swe-REbench is interesting. The "RE" stands for re-testing after the models were launched....

NitpickLawyer • yesterday at 7:20 PM • 0 replies • view on HN

swe-REbench is interesting. The "RE" stands for re-testing after the models were launched. They periodically gather new issues from live repos on github, and have a slider where you can see the scores for all issues in a given interval. So if you wait ~2 months you can see how the models perform on new (to them) real-world issues.

It's still not as accurate as benchmarks on your own workflows, but it's better than the original benchmark. Or any other public benchmarks.

alt Hacker News