logoalt Hacker News

SpicyLemonZestyesterday at 8:36 PM0 repliesview on HN

Frontier model developers do not consider SWE-bench to be reliable. OpenAI announced in February (https://openai.com/index/why-we-no-longer-evaluate-swe-bench...) that they consider it hopelessly contaminated, advocating for a new version SWE-bench Pro that was published more recently. (They seem to believe that even the publicly accessible part of the SWE-bench Pro problem set will be more resistant to training set contamination issues in the future, for reasons that to be honest I don't really understand.)