logoalt Hacker News

montroseryesterday at 5:32 PM1 replyview on HN

> SWE-bench Verified 59.2

This seems pretty darn good for a 30B model. That's significantly better than the full Qwen3-Coder 480B model at 55.4.


Replies

achieriusyesterday at 6:13 PM

I think most have moved past SWE-Bench Verified as a benchmark worth tracking -- it only tracks a few repos, contains only a small number of languages, and probably more importantly papers have come out showing a significant degree of memorization in current models, e.g. models knowing the filepath of the file containing the bug when prompted only with the issue description and without having access to the actual filesystem. SWE-Bench Pro seems much more promising though doesn't avoid all of the problems with the above.

show 1 reply