logoalt Hacker News

aspenmartinlast Tuesday at 5:33 PM6 repliesview on HN

Yes but there’s a reason we don’t evaluate these models this way and instead do it as carefully and thoughtfully as we can at scale. Human evaluations are important but they are an absolute minefield of footguns. 4.8 is not a downgrade from 4.6 there is an insane amount of hard data that contradicts this.


Replies

computerexlast Tuesday at 5:44 PM

The flip side is that benchmarks are gamed even by the top labs. Benchmark performance doesn't necessarily correlate with real world performance.

show 1 reply
gen220last Tuesday at 5:57 PM

Actually anecdata I gather on my job from myself and coworkers is the only benchmark I trust anymore, because it so heavily diverges from the “benchmarks”.

show 1 reply
recitedropperlast Tuesday at 6:06 PM

"Carefully and thoughtfully" is antithetical to the approach to benchmarks these days.

Maybe back when this was a scientific endeavor; not now when enormous, enormous amounts of capital are on the line. Along with an entire cult's chosen eschatology.

show 1 reply
pythonaut_16last Tuesday at 7:10 PM

Seems like a bunch of noise. What does this even mean?

It sounds like you're saying "Actually you, as a human, are simply not smart enough to evaluate Opus 4.8"

show 1 reply
OtomotOlast Tuesday at 7:20 PM

There is no data that I would trust that contradicts it.

Frankly I don't give a damn about data that could be made up on the spot or appears to be scientific or meaningful while it's not at all clear how it was made (up).

Claude was heavily lobotomised for my work starting somewhen in February.

I talked to friends and people I know and trust and many felt the same. (I didn't ask them whether they felt like I did, but what they felt, how happy they were with agentic coding etc.)

I quit my abo in March and talked to said friends who are still on a plan just last week: they are still not happy, but company pays so whatever...

show 1 reply
orbifoldlast Tuesday at 6:39 PM

[dead]