logoalt Hacker News

mohsen1today at 9:28 AM1 replyview on HN

SWE-bench Verified is nice but we need better SWE benchmarks. Making a fair benchmark is a lot of work and a lot of money needed to run it continuously.

Most of "live" benchmarks are not running enough with recent models to give you a good picture of which models win.

The idea of a live benchmark is great! There are thousands of GitHub issues that are resolved with a PR every day.


Replies

cbracketdashtoday at 9:55 AM

Help us out with Terminal Bench 3.0!

https://docs.google.com/document/d/1pe_gEbhVDgORtYsQv4Dyml8u...