These are very good numbers. I still don’t get why they don’t compare against latest competitor versions in these posts, it’s not like we’re all not going to notice.
I think the argument is that trying to suggest that they’re close to N months from SOTA.
Realistically I assume they hope readers don’t notice the fine details.
The Qwen models are great for open weights but for every past release they haven’t performed as well as the benchmarks in my experience. They’re optimizing for benchmark numbers because they know it works.
I think its part of the expectation setting (with a side of we did our distillation/ eval harness on a specific model).
if they say it's 4.7 comparable, it anchors that into your head as the model to evaluate against.
honestly, initial version of Opus-4.6 was much better than whatever we are being served right now as 4.7. If it performs same level to that, i'm totally willing to switch.
this puzzles me too, I want to know
Marketing.
[dead]
I find it forgivable if it's within minor version bump. (NB that x.5 is now a defacto major-version bump for LLMs for whatever reason).
Even with LLMs, posts like this don't just fall out of a coconut tree. If you have a set of target benchmarks for your own model, then keeping "the set" of side-by-side comparable models is its own maintenance headache.