Frontier model developers do not consider SWE-bench to be reliable. OpenAI announced in February (

SpicyLemonZest • yesterday at 8:36 PM • 0 replies • view on HN

Frontier model developers do not consider SWE-bench to be reliable. OpenAI announced in February (https://openai.com/index/why-we-no-longer-evaluate-swe-bench...) that they consider it hopelessly contaminated, advocating for a new version SWE-bench Pro that was published more recently. (They seem to believe that even the publicly accessible part of the SWE-bench Pro problem set will be more resistant to training set contamination issues in the future, for reasons that to be honest I don't really understand.)

alt Hacker News