logoalt Hacker News

kqrtoday at 3:58 PM2 repliesview on HN

It was never that great, it seems. For all of 2025 there was virtually no improvement in the rate at which models produced quality code. They only got better at passing automated tests.

https://entropicthoughts.com/no-swe-bench-improvement


Replies

civvvtoday at 6:19 PM

This is likely true. I think model quality has stagnated and that its likely a non-trivial task to find a new improvement vector. Scaling the width of the model (which has been the driving force behind the speed of improvement thus far) seems to have reached its limit.

It will be interesting to see the implications of this. Tooling can only do so much in the long term.

show 2 replies
cjsaltlaketoday at 6:45 PM

But, that's an enormous source of coding productivity, and it's why Anthropic is worth billions... The reason SWE-bench has been so successful and useful for coding is that software engineering has a ton of tradition and infrastructure for making and using automated tests.