We're also in benchmark saturation territory. I heard it speculated that Anthropic emphasizes benchmarks less in their publications because internally they don't care about them nearly as much as making a model that works well on the day-to-day
Seems pretty false if you look at the model card and web site of Opus 4.5 that is… (check notes) their latest model.
how do you quantitatively measure day-to-day quality? only thing i can think is A/B tests which take a while to evaluate
How do you measure whether it works better day to day without benchmarks?
Arc-AGI is just an iq test. I don’t see the problem with training it to be good at iq tests because that’s a skill that translates well.
These models still consistently fail the only benchmark that matters: if I give you a task, can you complete it successfully without making shit up?
Thus far they all fail. Code outputs don’t run, or variables aren’t captured correctly, or hallucinations are stated as factual rather than suspect or “I don’t know.”
It’s 2000’s PC gaming all over again (“gotta game the benchmark!”).