I suggest reading the Mythos report's discussion on SWE-bench and contamination. I think it...

cjsaltlake • yesterday at 6:39 PM • 2 replies • view on HN

I suggest reading the Mythos report's discussion on SWE-bench and contamination. I think it's fairly convincing that you can account for contamination and still trust SWE-bench numbers on models that aren't over-optimized for it.

Replies

kator • yesterday at 9:08 PM

> models that aren't over-optimized for it.

But how do you know the model was over-optimized for it or just really good?

kmdupree • yesterday at 10:29 PM

i disagree: https://www.philosophicalhacker.com/post/anthropic-error/

alt Hacker News

Replies