logoalt Hacker News

verdvermyesterday at 3:08 PM4 repliesview on HN

Some of the benchmarks went down, has that happened before?


Replies

andy12_yesterday at 3:24 PM

If you mean for Anthropic in particular, I don't think so. But it's not the first time a major AI lab publishes an incremental update of a model that is worse at some benchmarks. I remember that a particular update of Gemini 2.5 Pro improved results in LiveCodeBench but scored lower overall in most benchmarks.

https://news.ycombinator.com/item?id=43906555

grandinquistoryesterday at 3:14 PM

Probably deprioritizing other areas to focus on swe capabilities since I reckon most of their revenue is from enterprise coding usage.

show 1 reply
ACCount37yesterday at 3:19 PM

Constantly. Minor revisions can easily "wobble" on benchmarks that the training didn't explicitly push them for.

Whether it's genuine loss of capability or just measurement noise is typically unclear.

grandinquistoryesterday at 4:05 PM

looking at the system card for opus 4.7 the MCRC benchmark used for long context tasks dropped significantly from 78% to 32%

I wonder what caused such a large regression in this benchmark