logoalt Hacker News

verdvermyesterday at 6:28 PM5 repliesview on HN

We're also in benchmark saturation territory. I heard it speculated that Anthropic emphasizes benchmarks less in their publications because internally they don't care about them nearly as much as making a model that works well on the day-to-day


Replies

stego-techyesterday at 7:01 PM

These models still consistently fail the only benchmark that matters: if I give you a task, can you complete it successfully without making shit up?

Thus far they all fail. Code outputs don’t run, or variables aren’t captured correctly, or hallucinations are stated as factual rather than suspect or “I don’t know.”

It’s 2000’s PC gaming all over again (“gotta game the benchmark!”).

show 2 replies
quantumHazeryesterday at 6:34 PM

Seems pretty false if you look at the model card and web site of Opus 4.5 that is… (check notes) their latest model.

show 1 reply
brokensegueyesterday at 6:54 PM

how do you quantitatively measure day-to-day quality? only thing i can think is A/B tests which take a while to evaluate

show 1 reply
Mistletoeyesterday at 6:43 PM

How do you measure whether it works better day to day without benchmarks?

show 3 replies
HDThoreaunyesterday at 6:56 PM

Arc-AGI is just an iq test. I don’t see the problem with training it to be good at iq tests because that’s a skill that translates well.

show 3 replies