Does anyone troll these releases and cherry pick random metrics other companies would cherry pick to show how amazing their models are?
There's like 8 million benchmarks. Every release, every model randomly picks 5-10 where they win in everything except 1, to make it look like they aren't randomly cherry picking benchmarks they probably benchmaxxed for.
It's interesting they only included 6 metrics this time. Opus 4.7 had 12, and 4.6 had 13.
Of the metircs they reported for 4.7, for 4.8 they excluded BrowseComp, CharXiv Reasoning, CyberGym, GPQA Diamond, MCP Atlas, MMMLU, SWE-bench Verified. The last 4 were almost always mentioned in previous Opus releases.
I would take all benchmarks with a grain of salt. I don't really use them. What's it supposed to tell me? "5% smarter", what does that mean? My experience will differ. Just try it!
I doubt Anthropic internally sets as a goal to improve this or that benchmark - it's just a way to visualize progress. They probably have much more complex metrics internally.
On this note, is there a benchmark aggregator to compile all benchmarks in a single large grid?
At least they show competitors in any benchmark, compared to OpenAI which likes to pretend that there isn't any competitor.
https://arena.ai/leaderboard - I’ve found this company is a pretty good ranker - not sure their exact methodology but during day to day programming with Claude / gpt models I’ve felt qualitatively what they report