Would be interesting to see alternative scoring besides “tests pass”, e.g. diff size, abstraction depth added/removed, or whether the solution introduces new modules/dependencies. That would allow to see if “unmergeable” PRs correlate with simple structural signals.
makes sense! we wrote something yesterday about the weaknesses of test-based evals like swe-bench [1]
they are definitely useful but they miss the things that are hard to encode in tests, like spec/intent alignment, scope creep, adherence to codebase patterns, team preferences (risk tolerance, etc)
and those factors are really important. which means that test-evals should be relied upon more as weak/directional priors than as definitive measures of real-world usefulness
I've been working on building out "evals for your repo" based on the theory that commonly used benchmarks like SWE-bench are broken as they are not testing the right / valuable things, and are baked into the training data (see OpenAI's research on this here https://openai.com/index/why-we-no-longer-evaluate-swe-bench...)
Interestingly, I had a similar finding where, on the 3 open-source repos I ran evals on, the models (5.1-codex-mini, 5.3-codex, 5.4) all had relatively similar test scores, but when looking at other metrics, such as code quality, or equivalence to the original PR the task was based on, they had massive differences. posted results here if anyone is curious https://www.stet.sh/leaderboard
This aligns with something I've been noticing in practice: passing tests and being mergeable are fundamentally different quality bars. Tests verify behavior, but code review evaluates maintainability, readability, and whether the solution fits the broader architecture.
The SWE-bench metric essentially measures "can the AI produce a patch that makes tests pass" — which is closer to "junior developer who got the ticket done" than "experienced engineer who shipped clean code." The gap between those two is exactly where most code review friction lives.
What concerns me more is that as teams start using these benchmarks to evaluate AI coding tools, they might optimize for the wrong thing. A tool that produces mergeable PRs 40% of the time is arguably more valuable than one that passes tests 80% of the time but generates code that requires significant rework. We need benchmarks that capture the full review cycle, not just the CI pipeline.
There needs to be a measure (or measures) of the entropy of a codebase that provides a signal of complexity. When you're paying for every token, you want code patterns that convey a lot of immediate information to the agent so that it can either repeat the pattern, or extend it in a way that makes sense. This is probably the next wave of assisted coding (imo), because we're at the stage where writing code works, the quality is mostly decent, but it can be needlessly complex given the context of the existing repo.
This paper doesn’t really tell us much. The cutoff was September of 2025. The models have improved so much that I just don’t know what you can take away from this experiment.
I was totally aligned until I saw the refusal for a comment in the code. When the refusals are pedantic like that, it just weakens the overall findings significantly.
The test is supposed to be a proxy.
This seems like an important caveat to the SWE-bench, but the trend is still clearly AI becoming more and more capable.
I think a far greater problem is the human psychological and prejudice factor itself. When we heard AI assistance on a PR, we usually dive down the road to thinking about "oh my god is it another LLM slop" (for example: https://github.com/jneem/imbl/pull/149#pullrequestreview-370...). I do use AI but I review the code before I push it, yet most people don't. Once there is a trend, it is easy to form a prejudice and it is hard to go back, unless there is a substantial improvement both in quality and quantity.
Also, some people would have spoken outright rejecting any AI code, but most maintainers would employ the silent treatment tactics. And then when you demand them to review, they either close it or say that "I'm too busy" as an argument. I would call this one of the biggest dick move, because it hurts the most yet you can't find anything wrong with them until they reveal their motives.
Really interesting note. That echoes thoughts I’ve had about how much automated benchmark scores really reflect production‑ready code.
For me the big takeaway is that passing doesn't automatically mean it is maintainable, follows established patterns / conventions or have unexpected side effects that real reviewers care about.
Do these benchmarks make any sense? I tried a few local models that seem to be scoring well in SWE but the results were pure rubbish. (For instance MiniMax-M2.5 at 128GB from unslothed - completely unusable).
This makes sense to me based on personal experience. LLM's will do anything to pass tests and get a working result, and it will do very weird things in order to get there. For fun I've tried to get it to do stuff while being purposely ambiguous about the implementation details and sometimes the stuff it does makes me literally laugh out loud. It can write some very strange code.
But hey, the tests pass!
If I force it to use plan mode for everything and babysit it, it can work really well, but it's really just acting as a faster typer for me, which is great. But it requires an experienced dev steering it.
[dead]
Anecdote time! I had Codex GPT 5.4 xhigh generate a Rust proc macro. It's pretty straightforward: use sqlparser to parse a SQL statement and extract the column names of any row-producing queries.
It generated an implementation that worked well, but I hated the ~480 lines of code. The structure and flow was just... weird. It was hard to follow and I was seriously bugged by it.
So I asked it to reimplement it with some simplifications I gave it. It dutifully executed, producing a result >600 lines long. The flow was simpler and easier to follow, but still seemed excessive for the task at hand.
So I rolled up my sleeves and started deleting code and making changes manually. A little bit later, I had it down to <230 lines with a flow that was extremely easy to read and understand.
So yeah, I can totally see many SWE-bench-passing PRs being functionally correct but still terrible code that I would not accept.