Leaderboard

Model rankings and performance comparison across all benchmark tasks.

Rank ModelOverall Avg Score Best Score Runs
1
gpt-5.5
73.8
78.93
2
qwen3.6-plus
Alibaba
70.6
77.63
3
qwen3.6-flash
Alibaba
70.2
78.73
4
deepseek-v4-pro
DeepSeek
65.6
81.13
5
deepseek-v4-flash
DeepSeek
60.5
76.03
6
qwen3.5-397b-a17b
Alibaba
57.0
72.43
7
qwen3.6-27b
Alibaba
54.5
66.73
8
qwen3.5-flash
Alibaba
43.4
59.13
9
qwen3.5-27b
Alibaba
31.6
50.53

Domain Heatmap

Model performance across domains. Scores range from 0 to 100.

Model Calendar & Task Mgmt Coding & Software Dev Communication & Email Deep Research & Report DevOps & Env Repair Documents & Knowledge E-commerce & Daily Svcs Finance & Data Analytics Health & Fitness Social Media
gpt-5.5 68.7 66.3 62.4 67.2 75.7 82.5 76.6 98.7 75.9 56.7
qwen3.6-plus 60.2 72.0 73.5 60.2 78.5 78.5 76.3 95.4 58.1 42.4
qwen3.6-flash 67.8 52.9 73.0 60.6 75.1 76.1 74.3 93.3 67.2 53.5
deepseek-v4-pro 63.3 69.7 68.1 44.7 62.6 75.1 79.3 75.8 71.8 43.0
deepseek-v4-flash 38.3 71.3 56.4 54.2 62.8 77.1 71.5 80.0 44.8 33.5
qwen3.5-397b-a17b 45.6 62.2 58.4 53.0 60.4 68.6 64.3 69.0 33.2 44.1
qwen3.6-27b 33.8 48.4 69.1 36.4 59.6 89.5 68.7 67.4 27.1 30.6
qwen3.5-flash 16.0 42.7 7.3 56.6 60.8 70.2 44.5 64.5 23.3 16.4
qwen3.5-27b 15.3 68.6 41.3 25.4 27.0 53.8 32.8 23.7 21.2 14.7

Factor Heatmap

Model performance across complexity factors. Scores range from 0 to 100.

Model A1A2B1B2C1C2
gpt-5.5 70.5 69.9 63.8 68.9 78.6 55.0
qwen3.6-plus 67.8 71.8 59.5 68.3 64.3 55.0
qwen3.6-flash 66.7 68.0 57.9 68.7 73.8 66.1
deepseek-v4-pro 66.1 59.7 59.6 64.8 47.1 43.9
deepseek-v4-flash 54.7 58.7 53.2 66.6 33.3 64.4
qwen3.5-397b-a17b 54.1 55.0 46.0 61.5 52.4 64.4
qwen3.6-27b 51.4 51.9 49.0 76.2 19.0 38.9
qwen3.5-flash 39.6 49.9 37.7 56.3 23.8 22.2
qwen3.5-27b 25.3 33.2 27.2 49.7 15.7 38.9

Difficulty Breakdown

Model performance by difficulty level. Scores range from 0 to 100.

gpt-5.5
easy 95.2
medium 70.1
hard 23.7
qwen3.6-plus
easy 96.8
medium 63.8
hard 15.3
qwen3.6-flash
easy 95.1
medium 64.5
hard 15.6
deepseek-v4-pro
easy 87.6
medium 61.4
hard 15.0
deepseek-v4-flash
easy 87.8
medium 49.9
hard 13.2
qwen3.5-397b-a17b
easy 84.7
medium 44.8
hard 12.7
qwen3.6-27b
easy 89.0
medium 37.7
hard 3.9
qwen3.5-flash
easy 70.3
medium 30.4
hard 3.7
qwen3.5-27b
easy 51.1
medium 22.8
hard 1.7
Score scale: 0-100 · Metrics: overall, bestScore, easy, medium, hard, A1, A2, B1, B2, C1, C2 · Source: HuggingFace