These are very good numbers. I still don’t get why they don’t compare against latest competitor vers...

goyozi • today at 11:10 AM • 7 replies • view on HN

These are very good numbers. I still don’t get why they don’t compare against latest competitor versions in these posts, it’s not like we’re all not going to notice.

Replies

NiloCK • today at 1:14 PM

I find it forgivable if it's within minor version bump. (NB that x.5 is now a defacto major-version bump for LLMs for whatever reason).

Even with LLMs, posts like this don't just fall out of a coconut tree. If you have a set of target benchmarks for your own model, then keeping "the set" of side-by-side comparable models is its own maintenance headache.

Aurornis • today at 12:57 PM

I think the argument is that trying to suggest that they’re close to N months from SOTA.

Realistically I assume they hope readers don’t notice the fine details.

The Qwen models are great for open weights but for every past release they haven’t performed as well as the benchmarks in my experience. They’re optimizing for benchmark numbers because they know it works.

➕ show 1 reply

htrp • today at 12:32 PM

I think its part of the expectation setting (with a side of we did our distillation/ eval harness on a specific model).

if they say it's 4.7 comparable, it anchors that into your head as the model to evaluate against.

beydogan • today at 1:12 PM

honestly, initial version of Opus-4.6 was much better than whatever we are being served right now as 4.7. If it performs same level to that, i'm totally willing to switch.

➕ show 1 reply

hmokiguess • today at 12:15 PM

this puzzles me too, I want to know

maelito • today at 12:36 PM

Marketing.

pulse-dev • today at 1:15 PM

[dead]

alt Hacker News

Replies