logoalt Hacker News

SWE-bench Verified no longer measures frontier coding capabilities

200 pointsby kmdupreetoday at 1:58 PM119 commentsview on HN

Comments

ofirpresstoday at 6:32 PM

I'm a co-creator of SWE-bench:

1. SWE-bench Verified is now saturated at 93.9% (congrats Anthropic), but anyone who hasn't reached that number yet still has more room for growth.

2. SWE-bench Multilingual and SWE-bench Multimodal (which we'll open source in the next month) are still unsatured.

3. All benchmarks and benchmark paradigms eventually become saturated. That's why the SWE-bench team has worked hard on building the next stage of benchmarks, and we have a few that are already out, for example https://codeclash.ai/ or https://algotune.io/ . And we'll have more to say soon :)

show 5 replies
Jcampuzano2today at 3:22 PM

Its pretty clear that any benchmark that comes out will be outdated and exist within the training data with short measure. There will always be an incentive to optimize specifically for these benchmarks even if just for marketing material. Sure there is a training cutoff, but its usually only 3-6 months off of the public release dates.

The problem with coding benchmarks then becomes creating novel benchmarks that are guaranteed to not already be in the training data, and not borrow anything from previous benchmarks.

In this regard I don't think any benchmark that was created before a given model is released should ever be considered valid or representative of model performance. The potential financial gain for including the data just to be able to market a minor improvement is too swaying. With that in mind they should honestly just stop including benchmarks altogether in marketing material

Let the model speak for itself and let the community decide, but of course that will never slide with corporate types with so much money on the line.

show 9 replies
cpardtoday at 5:43 PM

Benchmarks/evals are really hard and they become harder when there’s huge incentive to game them at an industry scale.

ELT-Bench is another recent example. It was the first serious attempt at a benchmark for data engineering workloads, published about a year ago.

A few days ago, a follow-up paper from a group that includes one of the original authors audited the benchmark itself. The team gfound that the benchmark has structural issues that biased results.

Here’s the paper: https://arxiv.org/abs/2603.29399

None of these are new though, the industry has gone through all that before just in a smaller scale and there’s a lot to learn from that. Here’s a post I wrote on the parallels we see today to what happened with the benchmarketing wars of the database systems.

https://www.typedef.ai/blog/from-benchmarketing-to-benchmaxx...

show 3 replies
threeptstoday at 3:31 PM

Why don't they ask their premier model to generate a bench for them?

Jokes aside, a benchmark I look forward to is ARC-AGI-3. I tried out their human simulation, and it feels very reasoning heavy.

Leaderboard: https://arcprize.org/leaderboard

(Most premier models don't even pass 5 percent.)

show 5 replies
kqrtoday at 3:58 PM

It was never that great, it seems. For all of 2025 there was virtually no improvement in the rate at which models produced quality code. They only got better at passing automated tests.

https://entropicthoughts.com/no-swe-bench-improvement

show 2 replies
rustyhancocktoday at 4:28 PM

I think an Olympiad format is better. But the financial incentive is such that it might be near impossible to stop leaks.

I.e. A panel comes up with a series of problems.

Like advent of code or project Euler but more complex and constricted.

Benchmark outcomes could be performance points and measure of cost, time to solution (well token count really).

A couple times per year it's run.

It avoids overfitting.

Overtime the tasks can become more complex if needed.

If they benchmax it into being able to complete full products from spec and robust implementations amazing.

show 1 reply
marlburrowtoday at 7:12 PM

The "private benchmarks" suggestion comes up every time, but I think there's a more interesting axis: benchmarks built on top of already-public, already-stable test instruments. SWE-bench is fundamentally a corpus that lives on GitHub — once it ships, it leaks into training data automatically. Benchmarks built on contested qualitative instruments (psych tests, opinion surveys) have a different contamination profile because the correct answer doesn't exist in the training corpus to memorize — only the question does.

That doesn't help for measuring coding ability specifically (you fundamentally need a code-correctness oracle), but for capability axes where the "answer" is a stated position rather than a verifiable fact, public + stable can still be useful. The SWE-bench problem isn't really "public", it's "public + has a fixed correct answer".

vintagedavetoday at 3:13 PM

> We audited a 27.6% subset of the dataset that models often failed to solve and found that at least 59.4% of the audited problems have flawed test cases that reject functionally correct submissions, despite our best efforts in improving on this in the initial creation of SWE-bench Verified.

Is this saying a quarter* of the questions and answers were wrong, this whole time?!

If so, how was this ever, in any way, a valid measurement?

And what was the process for creating this benchmark and how did it end up with such an extraordinarily poor set of data? (There is a description later of how, which seems to be a high standard and I struggle to understand how it aligns with the other results they discuss.) Kudos to them for highlighting the issues, but I am left with questions.

[*] Not one in four, but one in six, thanks commenters for the correction; leaving the original since, eh, my bad, and it lets replies make sense. I feel the broad point still stands!

show 5 replies
parenthesestoday at 5:34 PM

The timing makes me wonder if this is a direct response to Deepseek V4 having performance comparable to SOTA models.

gertlabstoday at 3:45 PM

A better benchmark needs to be objectively scored, have multi-disciplinary, breadth, and be scalable (no single correct answer).

That's what we designed at https://gertlabs.com. We put a lot of thought into it, and kept it mostly (not fully) related to problem solving through coding.

show 1 reply
1a527dd5today at 3:02 PM

This feels very much like "we are now moving the goal posts".

show 3 replies
axpy906today at 8:11 PM

Once the bench is public it’s out and probably in the training data. Better to have your own and test it on a new model.

languid-photictoday at 5:30 PM

It’s very hard to encode the properties that matter most in code in tests. [1]

[1] https://voratiq.com/blog/your-workflow-is-the-eval

ripvanwinkletoday at 3:30 PM

>>In our analysis we found that all frontier models we tested were able to reproduce the original, human-written bug fix used as the ground-truth reference, known as the gold patch, or verbatim problem statement specifics for certain tasks, indicating that all of them have seen at least some of the problems and solutions during training

this statement alone seems to invalidate the SWE-bench tests

lmeyerovtoday at 6:18 PM

It's been fun benchmarking AI investigations at botsbench.com . Part of it is checking for these kinds of issues - we recently started seeing contamination in our first generation challenge, and less obvious, agent sandbox escapes for other kinds of cheating. Fun times!

eugenekolotoday at 6:32 PM

Without SWE-Bench though, how will AI models properly game their results to show ~5-10% gain each iteration?

Once a benchmark is known and there's billion of dollars on the line, obviously every company will game them.

djoldmantoday at 3:17 PM

> We have incorporated these findings into our recent evaluation efforts. In the last months we’ve chosen to report results from the public split of SWE-Bench Pro. We recommend other model developers do the same. SWE-bench Pro is not perfect, but empirically seems to suffer less from contamination issues.

https://arxiv.org/pdf/2509.16941

swyxtoday at 6:26 PM

more context in small writeup + we interviewd the team behind this when it was announced: https://www.latent.space/p/swe-bench-dead

wredcolltoday at 5:43 PM

This is somewhat tangential, but I want a model that can detect physical objects placed on top of a board from a picture/video, specifically warhammer 40k models.

I want a model that can detect the actual units/models that are placed on top of the terrain/board so I can track how the models move during the game, but trying gemini and chatgpt they were absolutely rubbish.

show 1 reply
Jimmc414today at 3:23 PM

Goodhart’s Law in reverse, what can’t be gamed gets rejected.

show 2 replies
cowartctoday at 4:32 PM

The headline leads with contamination, but buried is that 59% of audited failures had test design defects. That's a measurement system never validated against ground truth before being adopted industry-wide as a score that mattered. They reported on it for two years but the gauge was broken the entire time.

show 1 reply
w4yaitoday at 2:58 PM

I don't understand these websites which force translation to my native language.

I mean, it's fine as it's useful for many people, but where is the button for disabling it ? Or why is it enabled by default ?

"codage de pointe" sounds so weird and cringe in French.

show 2 replies
gpmtoday at 3:25 PM

Curiously Opus 4.7 claims to have a 87.6% pass rate and Mythos claims to have a 93.9% pass rate... leading to the conclusion that it's actually possible to "solve" the problems that OpenAI claims are incorrect.

show 4 replies
adityamwaghtoday at 3:16 PM

> We also found evidence that models that have seen the problems during training are more likely to succeed, because they have additional information needed to pass the underspecified tests.

No shit, Sherlock!

neuroelectrontoday at 4:42 PM

It's really naïve to think any of the big AI companies won't cheat

DeathArrowtoday at 3:51 PM

So we need to generate benchmarks after the models finish training. Or we need to keep the solutions to the benchmark problems as closed source.

retinarostoday at 3:35 PM

it never did

DeathArrowtoday at 4:23 PM

So Opus 4.7 and Mythos are solving problems that are impossible to solve?

show 1 reply
varispeedtoday at 3:23 PM

Issue with these benchmark also is that they measure a model you are unlikely going to be routed to. My experience with Anthropic is that despite using Opus 4.6 and 4.7, most of the time the performance is matching low B parameter Qwen. I think there should be a way to verify what model is actually being used to process prompts - that should be independently verified. At the moment it is so bad, you have to ask verification question to the model in form of a non-trivial problem. If it solves it, then there is a chance you actually get Opus and not an impostor and so you can continue the session instead of restarting it hoping you get routed correctly. But that does not help if model is replaced with cheaper one mid session. I've got so much work lost because of these shenanigans.

show 2 replies
chhxdjsjtoday at 8:37 PM

[dead]

hibouailetoday at 8:22 PM

[dead]

vdalhambratoday at 5:58 PM

[dead]

techpulselabtoday at 4:05 PM

[dead]

alphainfotoday at 5:55 PM

[dead]

ryguztoday at 4:50 PM

[dead]

tripleeetoday at 6:43 PM

[dead]

huflungdungtoday at 5:12 PM

[dead]

neversupervisedtoday at 3:04 PM

Terminal Bench is the future

show 1 reply