Leaderboard
Model rankings and performance comparison across all benchmark tasks.
| Rank ↑ | Model | Overall Avg Score ↕ | Best Score ↕ | Runs ↕ |
|---|---|---|---|---|
| 1 | gpt-5.5 | 73.8 | 78.9 | 3 |
| 2 | qwen3.6-plus Alibaba | 70.6 | 77.6 | 3 |
| 3 | qwen3.6-flash Alibaba | 70.2 | 78.7 | 3 |
| 4 | deepseek-v4-pro DeepSeek | 65.6 | 81.1 | 3 |
| 5 | deepseek-v4-flash DeepSeek | 60.5 | 76.0 | 3 |
| 6 | qwen3.5-397b-a17b Alibaba | 57.0 | 72.4 | 3 |
| 7 | qwen3.6-27b Alibaba | 54.5 | 66.7 | 3 |
| 8 | qwen3.5-flash Alibaba | 43.4 | 59.1 | 3 |
| 9 | qwen3.5-27b Alibaba | 31.6 | 50.5 | 3 |
Domain Heatmap
Model performance across domains. Scores range from 0 to 100.
| Model | Calendar & Task Mgmt | Coding & Software Dev | Communication & Email | Deep Research & Report | DevOps & Env Repair | Documents & Knowledge | E-commerce & Daily Svcs | Finance & Data Analytics | Health & Fitness | Social Media |
|---|---|---|---|---|---|---|---|---|---|---|
| gpt-5.5 | 68.7 | 66.3 | 62.4 | 67.2 | 75.7 | 82.5 | 76.6 | 98.7 | 75.9 | 56.7 |
| qwen3.6-plus | 60.2 | 72.0 | 73.5 | 60.2 | 78.5 | 78.5 | 76.3 | 95.4 | 58.1 | 42.4 |
| qwen3.6-flash | 67.8 | 52.9 | 73.0 | 60.6 | 75.1 | 76.1 | 74.3 | 93.3 | 67.2 | 53.5 |
| deepseek-v4-pro | 63.3 | 69.7 | 68.1 | 44.7 | 62.6 | 75.1 | 79.3 | 75.8 | 71.8 | 43.0 |
| deepseek-v4-flash | 38.3 | 71.3 | 56.4 | 54.2 | 62.8 | 77.1 | 71.5 | 80.0 | 44.8 | 33.5 |
| qwen3.5-397b-a17b | 45.6 | 62.2 | 58.4 | 53.0 | 60.4 | 68.6 | 64.3 | 69.0 | 33.2 | 44.1 |
| qwen3.6-27b | 33.8 | 48.4 | 69.1 | 36.4 | 59.6 | 89.5 | 68.7 | 67.4 | 27.1 | 30.6 |
| qwen3.5-flash | 16.0 | 42.7 | 7.3 | 56.6 | 60.8 | 70.2 | 44.5 | 64.5 | 23.3 | 16.4 |
| qwen3.5-27b | 15.3 | 68.6 | 41.3 | 25.4 | 27.0 | 53.8 | 32.8 | 23.7 | 21.2 | 14.7 |
Factor Heatmap
Model performance across complexity factors. Scores range from 0 to 100.
| Model | A1 | A2 | B1 | B2 | C1 | C2 |
|---|---|---|---|---|---|---|
| gpt-5.5 | 70.5 | 69.9 | 63.8 | 68.9 | 78.6 | 55.0 |
| qwen3.6-plus | 67.8 | 71.8 | 59.5 | 68.3 | 64.3 | 55.0 |
| qwen3.6-flash | 66.7 | 68.0 | 57.9 | 68.7 | 73.8 | 66.1 |
| deepseek-v4-pro | 66.1 | 59.7 | 59.6 | 64.8 | 47.1 | 43.9 |
| deepseek-v4-flash | 54.7 | 58.7 | 53.2 | 66.6 | 33.3 | 64.4 |
| qwen3.5-397b-a17b | 54.1 | 55.0 | 46.0 | 61.5 | 52.4 | 64.4 |
| qwen3.6-27b | 51.4 | 51.9 | 49.0 | 76.2 | 19.0 | 38.9 |
| qwen3.5-flash | 39.6 | 49.9 | 37.7 | 56.3 | 23.8 | 22.2 |
| qwen3.5-27b | 25.3 | 33.2 | 27.2 | 49.7 | 15.7 | 38.9 |
Difficulty Breakdown
Model performance by difficulty level. Scores range from 0 to 100.
gpt-5.5
easy 95.2
medium 70.1
hard 23.7
qwen3.6-plus
easy 96.8
medium 63.8
hard 15.3
qwen3.6-flash
easy 95.1
medium 64.5
hard 15.6
deepseek-v4-pro
easy 87.6
medium 61.4
hard 15.0
deepseek-v4-flash
easy 87.8
medium 49.9
hard 13.2
qwen3.5-397b-a17b
easy 84.7
medium 44.8
hard 12.7
qwen3.6-27b
easy 89.0
medium 37.7
hard 3.9
qwen3.5-flash
easy 70.3
medium 30.4
hard 3.7
qwen3.5-27b
easy 51.1
medium 22.8
hard 1.7
Score scale: 0-100 · Metrics: overall, bestScore, easy, medium, hard, A1, A2, B1, B2, C1, C2 · Source: HuggingFace