logoalt Hacker News

Through the looking glass of benchmark hacking

31 pointsby jxmorris12yesterday at 9:24 PM11 commentsview on HN

Comments

mgrundtoday at 6:55 PM

I was under the impression that swe-bench (and I guess most other benchmarks) were supposed to be run offline?

I get that you may accidentally include something in local git history, but it feels off to me to run these kinds of benchmarks online.

fshtoday at 3:51 PM

I don't get the point. The model has presumably been trained on all public GitHub code, so the evaluation is tainted anyway.

show 2 replies
pratiotoday at 4:15 PM

Are you guys affiliated to https://poolside.fm/ or https://poolsuite.net?

show 1 reply
ej88today at 4:54 PM

This is cool!

I used to work on post-training & evals. it's really hard to make a good eval set and catch all forms of reward hacking. Excited to see more from poolside!

schnitzelstoattoday at 3:14 PM

It was an interesting read - perhaps I misunderstood the part about blocking GitHub, but is not possible just to block it from accessing that specific repo?

show 1 reply