score age size name
62.0 8 - Claude Fable 5 (Adaptive Reasoning, Max Effort, Opus 4.8 Fallback)
59.1 55 - GPT-5.5 (xhigh)
58.5 55 - GPT-5.5 (high)
57.2 104 - GPT-5.4 (xhigh)
56.7 20 - Claude Opus 4.8 (Adaptive Reasoning, Max Effort)
56.2 55 - GPT-5.5 (medium)
55.5 118 - Gemini 3.1 Pro Preview
53.1 132 - GPT-5.3 Codex (xhigh)
53.1 62 - Claude Opus 4.7 (Non-reasoning, High Effort)
52.5 62 - Claude Opus 4.7 (Adaptive Reasoning, Max Effort)
52.1 55 - GPT-5.5 (low)
51.5 92 - GPT-5.4 mini (xhigh)
50.9 120 - Claude Sonnet 4.6 (Adaptive Reasoning, Max Effort)
50.7 1 large GLM-5.2 (max)
50.1 29 - Qwen3.7 Max
48.7 188 - GPT-5.2 (xhigh)
48.6 55 - GPT-5.5 (Non-reasoning)
48.1 132 - Claude Opus 4.6 (Adaptive Reasoning, Max Effort)
47.8 205 - Claude Opus 4.5 (Reasoning) rank score age size name
1 62.0 8 - Claude Fable 5 (Adaptive Reasoning, Max Effort, Opus 4.8 Fallback)
2 59.1 55 - GPT-5.5 (xhigh)
3 58.5 55 - GPT-5.5 (high)
4 57.2 104 - GPT-5.4 (xhigh)
5 56.7 20 - Claude Opus 4.8 (Adaptive Reasoning, Max Effort)
6 55.5 118 - Gemini 3.1 Pro Preview
7 53.1 62 - Claude Opus 4.7 (Non-reasoning, High Effort)
8 53.1 132 - GPT-5.3 Codex (xhigh)
9 52.5 62 - Claude Opus 4.7 (Adaptive Reasoning, Max Effort)
10 51.5 92 - GPT-5.4 mini (xhigh)
11 50.9 120 - Claude Sonnet 4.6 (Adaptive Reasoning, Max Effort)
12 50.7 1 large GLM-5.2 (max)
13 50.1 29 - Qwen3.7 Max
14 48.7 188 - GPT-5.2 (xhigh)
15 48.1 132 - Claude Opus 4.6 (Adaptive Reasoning, Max Effort)
16 47.8 205 - Claude Opus 4.5 (Reasoning)
17 47.6 132 - Claude Opus 4.6 (Non-reasoning, High Effort)
18 47.5 70 - Muse Spark
19 47.5 54 large DeepSeek V4 Pro (Reasoning, Max Effort)
20 47.1 58 large Kimi K2.6
21 47.1 29 - Gemini 3.5 Flash (minimal)
22 46.7 449 - Gemini 2.5 Pro Preview (Mar' 25)
23 46.5 211 - Gemini 3 Pro Preview (high)
24 46.5 16 - Qwen3.7 Plus
25 46.4 120 - Claude Sonnet 4.6 (Non-reasoning, High Effort)
26 45.6 5 large Kimi K2.7 Code
27 45.6 104 - GPT-5.4 (low)
28 45.5 56 large MiMo-V2.5-Pro
29 45.1 43 - GPT-5.5 Instant (May 2026)
30 45.0 29 - Gemini 3.5 Flash (high)
31 44.9 58 - Qwen3.6 Max Preview
32 44.7 216 - GPT-5.1 (high)
33 44.2 188 - GPT-5.2 (medium)
34 44.2 126 large GLM-5 (Reasoning)
35 43.9 92 - GPT-5.4 nano (xhigh)
36 43.4 71 large GLM-5.1 (Reasoning)
37 43.4 16 large MiniMax-M3
38 43.2 54 large DeepSeek V4 Pro (Reasoning, High Effort)
39 43.0 188 - GPT-5.2 Codex (xhigh)
40 42.9 76 - Qwen3.6 Plus
41 42.9 205 - Claude Opus 4.5 (Non-reasoning)
42 42.6 182 - Gemini 3 Flash Preview (Reasoning)
43 42.2 99 - Grok 4.20 0309 (Reasoning)
44 42.1 56 large MiMo-V2.5
45 41.9 91 large MiniMax-M2.7
46 41.4 91 - MiMo-V2-Pro
47 41.3 121 large Qwen3.5 397B A17B (Reasoning)
48 41.0 48 - Grok 4.3 (high)
49 40.5 71 - Grok 4.20 0309 v2 (Reasoning)
50 40.5 342 - Grok 4
51 39.8 54 large DeepSeek V4 Flash (Reasoning, High Effort)
A longer curated list based on kristopolous’ list, with more models included. For each model, I kept only the two highest-scoring entries. I used DeepSeek V4 Flash as the cutoff, since I consider it the lowest acceptable model that is still locally deployable.you left some models out like DeepSeek and Kimi, for example.
Short comments...
- GPT 5.5 consistently the best, an opinion who gets me constant downvotes here by the Anthropic Marketeer strike force...
- China is going to eat the US lunch on AI
- What have European universities and companies been doing? Its like if, on a parallel past/future, Nikola Tesla and Edison would have created flying Cyberpunk machines, while Europeans researchers, would be getting together to request EU funds, for investigation on how to breed faster horses.
- If Zuckerberg could be fired, after spending a total of $235 billion on AI and having NOTHING to show for...should he be fired?
Lol thank you for sorting.
Are the scores here normalized such that each point difference is equidistant?