logoalt Hacker News

jasonjmcgheeyesterday at 8:17 PM1 replyview on HN

Interesting selection of models for the "instruction count vs. accuracy" plot. Curious when that was done and why they chose those models. How well does ChatGPT 5/5.1 (and codex/mini/nano variants), Gemini 3, Claude Haiku/Sonnet/Opus 4.5, recent grok models, Kimi 2 Thinking etc (this generation of models) do?


Replies

alansaberyesterday at 8:25 PM

Guessing they included some smaller models just to show how they dump accuracy at smaller context sizes

show 1 reply