What do you like to use instead? I’ve used the aider leaderboard a couple times, but it didn’t reall...

robbies • yesterday at 6:39 PM • 1 reply • view on HN

What do you like to use instead? I’ve used the aider leaderboard a couple times, but it didn’t really stick with me

Replies

NitpickLawyer • yesterday at 7:20 PM

swe-REbench is interesting. The "RE" stands for re-testing after the models were launched. They periodically gather new issues from live repos on github, and have a slider where you can see the scores for all issues in a given interval. So if you wait ~2 months you can see how the models perform on new (to them) real-world issues.

It's still not as accurate as benchmarks on your own workflows, but it's better than the original benchmark. Or any other public benchmarks.

alt Hacker News

Replies