logoalt Hacker News

yfontanayesterday at 9:09 PM1 replyview on HN

OpenAI wrote a couple months ago that they do not consider SWE Bench Verified a meaningful benchmark anymore (and they were the ones who published it in the first place): https://openai.com/index/why-we-no-longer-evaluate-swe-bench...


Replies

kaonashi-tyc-01yesterday at 9:14 PM

Yep, I read this blog. What confuses me is that Anthropic doesn't seem to be bothered by this study and keeps publishing Verified results.

That is what gets me curious in the first place. The fact Mythos scored so high, IMO, exposes some issues with this model: it is able to solve seemingly impossible to solve problems.

Without cheating allegation, which I don't think ANT is doing, it has to be doing some fortune telling/future reading to score that high at all.