logoalt Hacker News

potatoman22today at 4:47 AM1 replyview on HN

Not to be nitpicky, but many of the 4-12b models are somewhere between GPT-3.5 and GPT-4o-mini. It's hard to find a good comparison though, because the benchmarks people score models against change so often. For reference, Sonnet 3.6 came out about a year after GPT 3.5


Replies

nltoday at 5:25 AM

Don't worry about being nitpicky! I'm going to out-nitpick you....

Actually....

I write and publish my own benchmark for this stuff. It's an agentic SQL benchmark which isn't in the training data yet and I've found can separate frontier models from close-followers (the only models to get 100% are Opus 4.6 and GPT 5.5).

The best small model I've found is a fine-tune of Opus-3.5 9B which scores 18/25: https://sql-benchmark.nicklothian.com/?highlight=Jackrong_Qw...

Haiku 4.5 scores 20/25, and Haiku is certainly better than Sonnet 3.6. GPT 3.5 scores 13/25.