SWE-bench Verified is nice but we need better SWE benchmarks. Making a fair benchmark is a lot of work and a lot of money needed to run it continuously.
Most of "live" benchmarks are not running enough with recent models to give you a good picture of which models win.
The idea of a live benchmark is great! There are thousands of GitHub issues that are resolved with a PR every day.
Help us out with Terminal Bench 3.0!
https://docs.google.com/document/d/1pe_gEbhVDgORtYsQv4Dyml8u...