Mizan Leaderboard
Performance comparison for Turkish embedding models
10 | 69.39 | 63.51 | 75.68 | 35.26 | 78.88 | 57.89 | 69.83 | 0.61 | 567M | 1024 | 131072 | 250002 |
🔍 How to Use the Leaderboard:
- Search: Use the search box to find specific models
- Color Coding: Scores are color-coded from red (low) to green (high)
- Sorting: Click on column headers to sort by different metrics
- Rankings: Models are ranked by Mean (Task) score
📊 Performance Insights:
- Top Performers: Models with Mean (Task) > 65 show strong overall performance
- Specialized Models: Some models excel in specific tasks (e.g., retrieval vs classification)
- Model Size vs Performance: Larger models generally perform better but with exceptions
🚀 Submit Model for Evaluation
Submit your Turkish embedding model for evaluation on the MTEB Turkish benchmark. Authentication with Hugging Face is required to submit evaluations.
📋 Evaluation Process:
- Sign In: First, sign in with your Hugging Face account using the button above
- Submit Request: Fill out the form with your model details and email
- Admin Review: Your request will be reviewed by administrators
- Evaluation: If approved, your model will be evaluated on MTEB Turkish benchmark
- Results: You'll receive email notifications and results will appear on the leaderboard
⚠️ Important Notes:
- Authentication Required: You must be logged in with Hugging Face to submit evaluations
- You'll receive email updates about your request status
- Make sure your model is publicly available on HuggingFace
- Valid email address is required for receiving results
📊 MTEB Turkish Dataset Overview
MTEB Turkish Task Details
PairClassification | Historical Turkish document retrieval | Knowledge QA | ~135K |
Retrieval | Turkish FAQ retrieval task | FAQ/QA | ~135K | |
Retrieval | Turkish question answering retrieval | QA | ~10K | |
Retrieval | Historical Turkish document retrieval | Historical | ~1.4K | |
Retrieval | Multilingual knowledge QA retrieval | Knowledge QA | ~10K | |
Classification | Intent classification for Turkish | Intent | ~11K | |
Classification | Scenario classification for Turkish | Scenario | ~11K | |
Classification | Multilingual sentiment classification | Sentiment | ~4.5K | |
Classification | SIB200 language identification | Language ID | ~700 | |
Classification | Turkish movie review sentiment | Movies | ~8K | |
Classification | Turkish product review sentiment | Products | ~4.8K | |
Clustering | SIB200 clustering task | Language ID | ~1K | |
PairClassification | Turkish natural language inference | NLI | ~1.4K | |
PairClassification | Enhanced Turkish NLI task | NLI | ~1.4K | |
STS | Turkish semantic textual similarity | STS | ~400 |
📈 Task Distribution:
By Task Type:
- Classification: 6 tasks (sentiment, intent, scenario, language identification)
- Retrieval: 4 tasks (FAQ, QA, historical documents, knowledge QA)
- Pair Classification: 2 tasks (natural language inference)
- Clustering: 1 task (language clustering)
- STS: 1 task (semantic textual similarity)
By Domain:
- Sentiment Analysis: Movie and product reviews
- Question Answering: FAQ, reading comprehension, and knowledge QA
- Intent/Scenario: Conversational AI applications
- Language Tasks: NLI, STS, clustering
- Multilingual: Cross-lingual evaluation capabilities
Dataset Statistics Summary
Avg. Tokens per Sample | Turkish + Multilingual | Classification, Retrieval, STS, NLI, Clustering |
Total Tasks | 14 tasks | Comprehensive evaluation across domains |
Total Samples | ~190K samples | Large-scale evaluation dataset |
Task Types | 5 types | Classification, Retrieval, STS, NLI, Clustering |
Languages | Turkish + Multilingual | Focus on Turkish with multilingual support |
Avg. Tokens per Sample | ~150 tokens | Varies by task type and domain |
🎯 Evaluation Methodology:
Scoring:
- Each task uses task-specific metrics (accuracy, F1, recall@k, etc.)
- Mean (Task): Direct average of all individual task scores
- Mean (TaskType): Average of task category means
- Individual Categories: Performance in each task type
Model Ranking:
- Primary ranking by Mean (Task) score
- Correlation metrics provide additional insights
- Task-specific performance shows model strengths
Quality Assurance:
- Standardized evaluation protocols
- Consistent preprocessing across tasks
- Multiple metrics per task for robustness
📊 Metrics Explanation:
- Mean (Task): Average performance across all individual tasks
- Mean (TaskType): Average performance by task categories
- Classification: Performance on Turkish classification tasks
- Clustering: Performance on Turkish clustering tasks
- Pair Classification: Performance on pair classification tasks (like NLI)
- Retrieval: Performance on information retrieval tasks
- STS: Performance on Semantic Textual Similarity tasks
- Correlation: Weighted average of correlation metrics for NLI and STSB datasets
- Parameters: Number of model parameters
- Embed Dim: Embedding dimension size
- Max Seq Length: Maximum sequence length the model can process (0 = infinite/unlimited)
- Vocab Size: Size of the model's vocabulary
📖 About Mizan:
This leaderboard presents results from the Mizan benchmark, which evaluates embedding models on Turkish language tasks across multiple domains including:
- Text classification and sentiment analysis
- Information retrieval and search
- Semantic textual similarity
- Text clustering and pair classification
🚀 Submit Your Model:
Use the Submit tab to submit your Turkish embedding model for evaluation. Your request will be reviewed by administrators and you'll receive email notifications about the progress.
Contact:
For any questions or feedback, please contact info@newmind.ai
Links:
- GitHub: mteb/mteb v1.38.51 - Mizan is currently based on MTEB v1.38.51 (MTEB v2.0.0 support coming soon)