logoalt Hacker News

cjsaltlakeyesterday at 6:39 PM2 repliesview on HN

I suggest reading the Mythos report's discussion on SWE-bench and contamination. I think it's fairly convincing that you can account for contamination and still trust SWE-bench numbers on models that aren't over-optimized for it.


Replies

katoryesterday at 9:08 PM

> models that aren't over-optimized for it.

But how do you know the model was over-optimized for it or just really good?