logoalt Hacker News

achieriusyesterday at 6:13 PM1 replyview on HN

I think most have moved past SWE-Bench Verified as a benchmark worth tracking -- it only tracks a few repos, contains only a small number of languages, and probably more importantly papers have come out showing a significant degree of memorization in current models, e.g. models knowing the filepath of the file containing the bug when prompted only with the issue description and without having access to the actual filesystem. SWE-Bench Pro seems much more promising though doesn't avoid all of the problems with the above.


Replies

robbiesyesterday at 6:39 PM

What do you like to use instead? I’ve used the aider leaderboard a couple times, but it didn’t really stick with me

show 1 reply