The problem isn’t that the tasks are impossible to solve, it’s that they’re underspecified and/or impossible to solve consistently (ex. because a test is expecting the solution function to have a specific name that wasn’t specified in the task itself).
So maybe Anthropic runs Mythos through the benchmark 10000 times and takes the highest score, who knows?
We actually know that a "100% pass rate" is trivially possible: https://rdi.berkeley.edu/blog/trustworthy-benchmarks-cont/
Anthropic p-hacking the benchmark strikes me as cheating, and somewhat unlikely. Mythos figuring out how to cheat at the benchmark strikes me as much more likely.
But if that hypothesis is the explanation the interesting part is Opus 4.7 (but not 4.6) seems to be doing the same.