Another possible conclusion is that your definition of good, whatever that may be, doesn’t include the benchmarks these models are targeting.
I don’t actually know what they all are, but MATH-500 for instance is some math problem solving that Sonnet is not all that good at.
The benchmarks are targeting specific weaknesses that LLMs generally have from only learning next token prediction and instruction tuning. In fact, benchmarks show there are large gaps in some areas, like math, where even top models don’t perform well.
‘According to these benchmarks’ is key, but not for the reasons you’re expressing.
Option 3 3) It’s key because that’s the hole they’re trying to fill. Realistically, most people in personal usage aren’t using models to solve algebra problems, so the performance of that benchmark isn’t as visible if you aren’t using an LLM for that.
If you look at a larger suite of benchmarks, then I would expect them to underperform compared to sonnet. It’s no different than sports stats where you can say who is best at one specific part of the game (rebounds, 3 point shots, etc) and you have a general sense of who is best (eg LeBron, Jordan), but the best players are neither the best at everything and it’s hard to argue who is the ‘best of the best’ because that depends on what weight you give to the different individual benchmarks they’re good at. And then you also have a lot of players who are good at doing one thing.