I suggest reading the Mythos report's discussion on SWE-bench and contamination. I think it's fairly convincing that you can account for contamination and still trust SWE-bench numbers on models that aren't over-optimized for it.
> models that aren't over-optimized for it.
But how do you know the model was over-optimized for it or just really good?
i disagree: https://www.philosophicalhacker.com/post/anthropic-error/
> models that aren't over-optimized for it.
But how do you know the model was over-optimized for it or just really good?