logoalt Hacker News

nerevarthelametoday at 5:14 PM1 replyview on HN

It's interesting they only included 6 metrics this time. Opus 4.7 had 12, and 4.6 had 13.

Of the metircs they reported for 4.7, for 4.8 they excluded BrowseComp, CharXiv Reasoning, CyberGym, GPQA Diamond, MCP Atlas, MMMLU, SWE-bench Verified. The last 4 were almost always mentioned in previous Opus releases.


Replies

onlyrealcuzzotoday at 5:15 PM

Gonna assume it's because they barely budged or moved downward and most of their reported benchmark results are probably within sampling errors...

show 1 reply