TL;DR is that they didn't clean the repo (.git/ folder), model just reward hacked its way to look up future commits with fixes. Credit goes to everyone in this thread for solving this: https://xcancel.com/xeophon/status/2006969664346501589
(given that IQuestLab published their SWE-Bench Verified trajectory data, I want to be charitable and assume genuine oversight rather than "benchmaxxing", probably an easy to miss thing if you are new to benchmarking)
https://www.reddit.com/r/LocalLLaMA/comments/1q1ura1/iquestl...
> I want to be charitable and assume genuine oversight rather than "benchmaxxing", probably an easy to miss thing if you are new to benchmarking
I don't doubt that it's an oversight, it does however say something about the researchers when they didn't look at a single output where they would have immediately caught this.
Never escaping the hype vendor allegations at SWEbench are they.
As John says in that thread, we've fixed this issue in SWE-bench: https://xcancel.com/jyangballin/status/2006987724637757670
If you run SWE-bench evals, just make sure to use the most up-to-date code from our repo and the updated docker images