Mythos 5.5 SWE-bench Pro 77.8%* 58.6% ...

6thbit • yesterday at 7:19 PM • 4 replies • view on HN

                          Mythos     5.5
    SWE-bench Pro          77.8%*   58.6%
    Terminal-bench-2.0     82.0%    82.7%*
    GPQA Diamond           94.6%*   93.6%
    H. Last Exam           56.8%*   41.4%
    H. Last Exam (tools)   64.7%*   52.2%    
    BrowseComp             86.9%    84.4%  (90.1% Pro)*
    OSWorld-Verified       79.6%*   78.7%

Still far from Mythos on SWE-bench but quite comparable otherwise. Source for mythos values: https://www.anthropic.com/glasswing

Replies

aliljet • yesterday at 7:47 PM

Mythos is only real when it's actually available. If you're using Opus 4.7 right now, you know how incredibly nerfed the Opus autonomy is in service of perceived safety. I'm not so confident this will be as great as Anthropic wants us to believe..

XCSme • yesterday at 7:51 PM

They mentioned in their release page, that the Claude team noticed memorization of the SWE-bench test, so the test is actually in the training data.

Here: https://www.anthropic.com/news/claude-opus-4-7#:~:text=memor...

kaonashi-tyc-01 • yesterday at 8:49 PM

I did some study on Verified, not Pro, but Mythos number there rings a lot of questions on my end.

If you look at the SWEBench official submissions: https://github.com/SWE-bench/experiments/tree/main/evaluatio..., filter all models after Sonnet 4, and aggregate ALL models' submission across 500 problems, what I found that the aggregated resolution rate is 93% (sharp).

Mythos gets 93.7%, meaning it solves problems that no other models could ever solve. I took a look at those problems, then I became even more suspicious, for the remaining 7% problems, it is almost impossible to resolve those issues without looking at the testing patch ahead of time, because how drastically the solution itself deviates from the problem statement, it almost feels like it is trying to solve a different problem.

Not that I am saying Mythos is cheating, but it might be too capable to remember all states of said repos, that it is able to reverse engineer the TRUE problem statement by diffing within its own internal memory. I think it could be a unique phenomena of evaluation awareness. Otherwise I genuinely couldn't think of exactly how it could be this precise in deciphering such unspecific problem statements.

➕ show 1 reply

alansaber • yesterday at 8:27 PM

A single benchmark is meaningless, you always get quirky results on some benchmarks.

alt Hacker News

Replies