SWE-bench Verified is nice but we need better SWE benchmarks. Making a fair benchmark is a lot of wo...

mohsen1 • today at 9:28 AM • 1 reply • view on HN

SWE-bench Verified is nice but we need better SWE benchmarks. Making a fair benchmark is a lot of work and a lot of money needed to run it continuously.

Most of "live" benchmarks are not running enough with recent models to give you a good picture of which models win.

The idea of a live benchmark is great! There are thousands of GitHub issues that are resolved with a PR every day.

Replies

cbracketdash • today at 9:55 AM

Help us out with Terminal Bench 3.0!

https://docs.google.com/document/d/1pe_gEbhVDgORtYsQv4Dyml8u...

alt Hacker News

Replies