I'm a co-creator of SWE-bench:
1. SWE-bench Verified is now saturated at 93.9% (congrats Anthropic), but anyone who hasn't reached that number yet still has more room for growth.
2. SWE-bench Multilingual and SWE-bench Multimodal (which we'll open source in the next month) are still unsatured.
3. All benchmarks and benchmark paradigms eventually become saturated. That's why the SWE-bench team has worked hard on building the next stage of benchmarks, and we have a few that are already out, for example https://codeclash.ai/ or https://algotune.io/ . And we'll have more to say soon :)
> 93.6% (congrats Anthropic)
But the article says "We audited a 27.6% subset of the dataset that models often failed to solve [which is 19.1% of the problems at time of publication] and found that at least 59.4% of the audited problems have flawed test cases that reject functionally correct submission"
0.191 * 0.594 > 1 - 0.936
Does this mean that the audited subset wasn't representative? Or that Anthropic is getting high answers through some shady means?
> 1. SWE-bench Verified is now saturated at 93.9% (congrats Anthropic), but anyone who hasn't reached that number yet still has more room for growth.
But if some or all players are bench-maxing it, then it becomes a much less useful metric for comparison.
Also, this doesn't address what OpenAI says about the test suite disallowing valid solutions.
[dead]
Those who fail to study history (or live through it) are doomed to repeat it.
SPECint and SPECfp went through this exact movie: benchmark, saturate, retire, replace, repeat. The treadmill is the product.
I don't have the solution just noticing the pattern.