Through the looking glass of benchmark hacking

31 points • by jxmorris12 • yesterday at 9:24 PM • 11 comments • view on HN

Comments

I was under the impression that swe-bench (and I guess most other benchmarks) were supposed to be run offline?

I get that you may accidentally include something in local git history, but it feels off to me to run these kinds of benchmarks online.

fsh • today at 3:51 PM

I don't get the point. The model has presumably been trained on all public GitHub code, so the evaluation is tainted anyway.

➕ show 2 replies

pratio • today at 4:15 PM

Are you guys affiliated to https://poolside.fm/ or https://poolsuite.net?

➕ show 1 reply

ej88 • today at 4:54 PM

This is cool!

I used to work on post-training & evals. it's really hard to make a good eval set and catch all forms of reward hacking. Excited to see more from poolside!

schnitzelstoat • today at 3:14 PM

It was an interesting read - perhaps I misunderstood the part about blocking GitHub, but is not possible just to block it from accessing that specific repo?

➕ show 1 reply

alt Hacker News

Through the looking glass of benchmark hacking

Comments