logoalt Hacker News

goyozitoday at 11:10 AM7 repliesview on HN

These are very good numbers. I still don’t get why they don’t compare against latest competitor versions in these posts, it’s not like we’re all not going to notice.


Replies

NiloCKtoday at 1:14 PM

I find it forgivable if it's within minor version bump. (NB that x.5 is now a defacto major-version bump for LLMs for whatever reason).

Even with LLMs, posts like this don't just fall out of a coconut tree. If you have a set of target benchmarks for your own model, then keeping "the set" of side-by-side comparable models is its own maintenance headache.

Aurornistoday at 12:57 PM

I think the argument is that trying to suggest that they’re close to N months from SOTA.

Realistically I assume they hope readers don’t notice the fine details.

The Qwen models are great for open weights but for every past release they haven’t performed as well as the benchmarks in my experience. They’re optimizing for benchmark numbers because they know it works.

show 1 reply
htrptoday at 12:32 PM

I think its part of the expectation setting (with a side of we did our distillation/ eval harness on a specific model).

if they say it's 4.7 comparable, it anchors that into your head as the model to evaluate against.

beydogantoday at 1:12 PM

honestly, initial version of Opus-4.6 was much better than whatever we are being served right now as 4.7. If it performs same level to that, i'm totally willing to switch.

show 1 reply
hmokiguesstoday at 12:15 PM

this puzzles me too, I want to know

maelitotoday at 12:36 PM

Marketing.

pulse-devtoday at 1:15 PM

[dead]