logoalt Hacker News

MattRixyesterday at 3:41 PM1 replyview on HN

The problem isn’t that the tasks are impossible to solve, it’s that they’re underspecified and/or impossible to solve consistently (ex. because a test is expecting the solution function to have a specific name that wasn’t specified in the task itself).

So maybe Anthropic runs Mythos through the benchmark 10000 times and takes the highest score, who knows?


Replies

gpmyesterday at 3:52 PM

We actually know that a "100% pass rate" is trivially possible: https://rdi.berkeley.edu/blog/trustworthy-benchmarks-cont/

Anthropic p-hacking the benchmark strikes me as cheating, and somewhat unlikely. Mythos figuring out how to cheat at the benchmark strikes me as much more likely.

But if that hypothesis is the explanation the interesting part is Opus 4.7 (but not 4.6) seems to be doing the same.

show 1 reply