Mizan Leaderboard
Performance comparison for Turkish embedding models
|  10  |  69.39  |  63.51  |  75.68  |  35.26  |  78.88  |  57.89  |  69.83  |  0.61  |  567M  |  1024  |  131072  |  250002  | 
🔍 How to Use the Leaderboard:
- Search: Use the search box to find specific models
- Color Coding: Scores are color-coded from red (low) to green (high)
- Sorting: Click on column headers to sort by different metrics
- Rankings: Models are ranked by Mean (Task) score
📊 Performance Insights:
- Top Performers: Models with Mean (Task) > 65 show strong overall performance
- Specialized Models: Some models excel in specific tasks (e.g., retrieval vs classification)
- Model Size vs Performance: Larger models generally perform better but with exceptions
🚀 Submit Model for Evaluation
Submit your Turkish embedding model for evaluation on the MTEB Turkish benchmark. Authentication with Hugging Face is required to submit evaluations.
📋 Evaluation Process:
- Sign In: First, sign in with your Hugging Face account using the button above
- Submit Request: Fill out the form with your model details and email
- Admin Review: Your request will be reviewed by administrators
- Evaluation: If approved, your model will be evaluated on MTEB Turkish benchmark
- Results: You'll receive email notifications and results will appear on the leaderboard
⚠️ Important Notes:
- Authentication Required: You must be logged in with Hugging Face to submit evaluations
- You'll receive email updates about your request status
- Make sure your model is publicly available on HuggingFace
- Valid email address is required for receiving results
📊 MTEB Turkish Dataset Overview
MTEB Turkish Task Details
|  PairClassification  |  Historical Turkish document retrieval  |  Knowledge QA  |  ~135K  | 
|  Retrieval   |  Turkish FAQ retrieval task   |  FAQ/QA   |  ~135K   | |
|  Retrieval   |  Turkish question answering retrieval   |  QA   |  ~10K   | |
|  Retrieval   |  Historical Turkish document retrieval   |  Historical   |  ~1.4K   | |
|  Retrieval   |  Multilingual knowledge QA retrieval   |  Knowledge QA   |  ~10K   | |
|  Classification   |  Intent classification for Turkish   |  Intent   |  ~11K   | |
|  Classification   |  Scenario classification for Turkish   |  Scenario   |  ~11K   | |
|  Classification   |  Multilingual sentiment classification   |  Sentiment   |  ~4.5K   | |
|  Classification   |  SIB200 language identification   |  Language ID   |  ~700   | |
|  Classification   |  Turkish movie review sentiment   |  Movies   |  ~8K   | |
|  Classification   |  Turkish product review sentiment   |  Products   |  ~4.8K   | |
|  Clustering   |  SIB200 clustering task   |  Language ID   |  ~1K   | |
|  PairClassification   |  Turkish natural language inference   |  NLI   |  ~1.4K   | |
|  PairClassification   |  Enhanced Turkish NLI task   |  NLI   |  ~1.4K   | |
|  STS   |  Turkish semantic textual similarity   |  STS   |  ~400   | 
📈 Task Distribution:
By Task Type:
- Classification: 6 tasks (sentiment, intent, scenario, language identification)
- Retrieval: 4 tasks (FAQ, QA, historical documents, knowledge QA)
- Pair Classification: 2 tasks (natural language inference)
- Clustering: 1 task (language clustering)
- STS: 1 task (semantic textual similarity)
By Domain:
- Sentiment Analysis: Movie and product reviews
- Question Answering: FAQ, reading comprehension, and knowledge QA
- Intent/Scenario: Conversational AI applications
- Language Tasks: NLI, STS, clustering
- Multilingual: Cross-lingual evaluation capabilities
Dataset Statistics Summary
|  Avg. Tokens per Sample  |  Turkish + Multilingual  |  Classification, Retrieval, STS, NLI, Clustering  | 
|  Total Tasks   |  14 tasks   |  Comprehensive evaluation across domains   | 
|  Total Samples   |  ~190K samples   |  Large-scale evaluation dataset   | 
|  Task Types   |  5 types   |  Classification, Retrieval, STS, NLI, Clustering   | 
|  Languages   |  Turkish + Multilingual   |  Focus on Turkish with multilingual support   | 
|  Avg. Tokens per Sample   |  ~150 tokens   |  Varies by task type and domain   | 
🎯 Evaluation Methodology:
Scoring:
- Each task uses task-specific metrics (accuracy, F1, recall@k, etc.)
- Mean (Task): Direct average of all individual task scores
- Mean (TaskType): Average of task category means
- Individual Categories: Performance in each task type
Model Ranking:
- Primary ranking by Mean (Task) score
- Correlation metrics provide additional insights
- Task-specific performance shows model strengths
Quality Assurance:
- Standardized evaluation protocols
- Consistent preprocessing across tasks
- Multiple metrics per task for robustness
📊 Metrics Explanation:
- Mean (Task): Average performance across all individual tasks
- Mean (TaskType): Average performance by task categories
- Classification: Performance on Turkish classification tasks
- Clustering: Performance on Turkish clustering tasks
- Pair Classification: Performance on pair classification tasks (like NLI)
- Retrieval: Performance on information retrieval tasks
- STS: Performance on Semantic Textual Similarity tasks
- Correlation: Weighted average of correlation metrics for NLI and STSB datasets
- Parameters: Number of model parameters
- Embed Dim: Embedding dimension size
- Max Seq Length: Maximum sequence length the model can process (0 = infinite/unlimited)
- Vocab Size: Size of the model's vocabulary
📖 About Mizan:
This leaderboard presents results from the Mizan benchmark, which evaluates embedding models on Turkish language tasks across multiple domains including:
- Text classification and sentiment analysis
- Information retrieval and search
- Semantic textual similarity
- Text clustering and pair classification
🚀 Submit Your Model:
Use the Submit tab to submit your Turkish embedding model for evaluation. Your request will be reviewed by administrators and you'll receive email notifications about the progress.
Contact:
For any questions or feedback, please contact info@newmind.ai
Links:
- GitHub: mteb/mteb v1.38.51 - Mizan is currently based on MTEB v1.38.51 (MTEB v2.0.0 support coming soon)