logoalt Hacker News

robbiesyesterday at 6:39 PM1 replyview on HN

What do you like to use instead? I’ve used the aider leaderboard a couple times, but it didn’t really stick with me


Replies

NitpickLawyeryesterday at 7:20 PM

swe-REbench is interesting. The "RE" stands for re-testing after the models were launched. They periodically gather new issues from live repos on github, and have a slider where you can see the scores for all issues in a given interval. So if you wait ~2 months you can see how the models perform on new (to them) real-world issues.

It's still not as accurate as benchmarks on your own workflows, but it's better than the original benchmark. Or any other public benchmarks.