Without SWE-Bench though, how will AI models properly game their results to show ~5-10% gain each iteration?
Once a benchmark is known and there's billion of dollars on the line, obviously every company will game them.