Geometric mean of MMMLU + GPQA-Diamond + SimpleQA + LiveCodeBench :
- Gemini 3.0 Pro : 84.8
- DeepSeek 3.2 : 83.6
- GPT-5.1 : 69.2
- Claude Opus 4.5 : 67.4
- Kimi-K2 (1.2T) : 42.0
- Mistral Large 3 (675B) : 41.9
- Deepseek-3.1 (670B) : 39.7
The 14B 8B & 3B models are SOTA though, and do not have chinese censorship like Qwen3.
How is there such a gap between Gemini 3 vs GPT 5.1/Opus 4.5? What is Gemini 3 crushing the others on?