Comparison of Gemma 4 vs. Qwen 3.5 benchmarks, consolidated from their respective Hugging Face model...

scrlk • today at 4:38 PM • 2 replies • view on HN

Comparison of Gemma 4 vs. Qwen 3.5 benchmarks, consolidated from their respective Hugging Face model cards:

    | Model          | MMLUP | GPQA  | LCB   | ELO  | TAU2  | MMMLU | HLE-n | HLE-t |
    |----------------|-------|-------|-------|------|-------|-------|-------|-------|
    | G4 31B         | 85.2% | 84.3% | 80.0% | 2150 | 76.9% | 88.4% | 19.5% | 26.5% |
    | G4 26B A4B     | 82.6% | 82.3% | 77.1% | 1718 | 68.2% | 86.3% |  8.7% | 17.2% |
    | G4 E4B         | 69.4% | 58.6% | 52.0% |  940 | 42.2% | 76.6% |   -   |   -   |
    | G4 E2B         | 60.0% | 43.4% | 44.0% |  633 | 24.5% | 67.4% |   -   |   -   |
    | G3 27B no-T    | 67.6% | 42.4% | 29.1% |  110 | 16.2% | 70.7% |   -   |   -   |
    | GPT-5-mini     | 83.7% | 82.8% | 80.5% | 2160 | 69.8% | 86.2% | 19.4% | 35.8% |
    | GPT-OSS-120B   | 80.8% | 80.1% | 82.7% | 2157 |  --   | 78.2% | 14.9% | 19.0% |
    | Q3-235B-A22B   | 84.4% | 81.1% | 75.1% | 2146 | 58.5% | 83.4% | 18.2% |  --   |
    | Q3.5-122B-A10B | 86.7% | 86.6% | 78.9% | 2100 | 79.5% | 86.7% | 25.3% | 47.5% |
    | Q3.5-27B       | 86.1% | 85.5% | 80.7% | 1899 | 79.0% | 85.9% | 24.3% | 48.5% |
    | Q3.5-35B-A3B   | 85.3% | 84.2% | 74.6% | 2028 | 81.2% | 85.2% | 22.4% | 47.4% |

    MMLUP: MMLU-Pro
    GPQA: GPQA Diamond
    LCB: LiveCodeBench v6
    ELO: Codeforces ELO
    TAU2: TAU2-Bench
    MMMLU: MMMLU
    HLE-n: Humanity's Last Exam (no tools / CoT)
    HLE-t: Humanity's Last Exam (with search / tool)
    no-T: no think

Replies

kpw94 • today at 5:05 PM

Wild differences in ELO compared to tfa's graph: https://storage.googleapis.com/gdm-deepmind-com-prod-public/...

(Comparing Q3.5-27B to G4 26B A4B and G4 31B specifically)

I'd assume Q3.5-35B-A3B would performe worse than the Q3.5 deep 27B model, but the cards you pasted above, somehow show that for ELO and TAU2 it's the other way around...

Very impressed by unsloth's team releasing the GGUF so quickly, if that's like the qwen 3.5, I'll wait a few more days in case they make a major update.

Overall great news if it's at parity or slightly better than Qwen 3.5 open weights, hope to see both of these evolve in the sub-32GB-RAM space. Disappointed in Mistral/Ministral being so far behind these US & Chinese models

➕ show 4 replies

bachmeier • today at 6:05 PM

So is there something I can take from that table if I have a 24 GB video card? I'm honestly not sure how to use those numbers.

➕ show 1 reply

alt Hacker News

Replies