Yes, how do we know Opus 4.8 hasn't been trained on the SWE-Bench examples? With a squillion ...

willtemperley • today at 6:28 AM • 1 reply • view on HN

Yes, how do we know Opus 4.8 hasn't been trained on the SWE-Bench examples?

With a squillion dollars at stake per bench point, someone will have figured out a plausibly deniable way to game these benchmarks.

Replies

stingraycharles • today at 12:48 PM

Ehr, the SWE bench examples are particularly horrible as those are just publicly available historical PRs. So if the models are trained on GitHub data, it will be included.

So almost by design that particular benchmark is tainted, and benchmarks recall rather than reasoning.

➕ show 1 reply

alt Hacker News

Replies