logoalt Hacker News

onlyrealcuzzotoday at 4:55 PM5 repliesview on HN

Does anyone troll these releases and cherry pick random metrics other companies would cherry pick to show how amazing their models are?

There's like 8 million benchmarks. Every release, every model randomly picks 5-10 where they win in everything except 1, to make it look like they aren't randomly cherry picking benchmarks they probably benchmaxxed for.


Replies

aronowb14today at 5:08 PM

https://arena.ai/leaderboard - I’ve found this company is a pretty good ranker - not sure their exact methodology but during day to day programming with Claude / gpt models I’ve felt qualitatively what they report

nerevarthelametoday at 5:14 PM

It's interesting they only included 6 metrics this time. Opus 4.7 had 12, and 4.6 had 13.

Of the metircs they reported for 4.7, for 4.8 they excluded BrowseComp, CharXiv Reasoning, CyberGym, GPQA Diamond, MCP Atlas, MMMLU, SWE-bench Verified. The last 4 were almost always mentioned in previous Opus releases.

show 1 reply
ddosmax556today at 5:52 PM

I would take all benchmarks with a grain of salt. I don't really use them. What's it supposed to tell me? "5% smarter", what does that mean? My experience will differ. Just try it!

I doubt Anthropic internally sets as a goal to improve this or that benchmark - it's just a way to visualize progress. They probably have much more complex metrics internally.

bel8today at 5:16 PM

On this note, is there a benchmark aggregator to compile all benchmarks in a single large grid?

YetAnotherNicktoday at 5:07 PM

At least they show competitors in any benchmark, compared to OpenAI which likes to pretend that there isn't any competitor.