We're also in benchmark saturation territory. I heard it speculated that Anthropic emphasizes b...

verdverm • yesterday at 6:28 PM • 5 replies • view on HN

We're also in benchmark saturation territory. I heard it speculated that Anthropic emphasizes benchmarks less in their publications because internally they don't care about them nearly as much as making a model that works well on the day-to-day

Replies

stego-tech • yesterday at 7:01 PM

These models still consistently fail the only benchmark that matters: if I give you a task, can you complete it successfully without making shit up?

Thus far they all fail. Code outputs don’t run, or variables aren’t captured correctly, or hallucinations are stated as factual rather than suspect or “I don’t know.”

It’s 2000’s PC gaming all over again (“gotta game the benchmark!”).

➕ show 2 replies

quantumHazer • yesterday at 6:34 PM

Seems pretty false if you look at the model card and web site of Opus 4.5 that is… (check notes) their latest model.

➕ show 1 reply

brokensegue • yesterday at 6:54 PM

how do you quantitatively measure day-to-day quality? only thing i can think is A/B tests which take a while to evaluate

➕ show 1 reply

Mistletoe • yesterday at 6:43 PM

How do you measure whether it works better day to day without benchmarks?

➕ show 3 replies

HDThoreaun • yesterday at 6:56 PM

Arc-AGI is just an iq test. I don’t see the problem with training it to be good at iq tests because that’s a skill that translates well.

➕ show 3 replies

alt Hacker News

Replies