I don't get the point. The model has presumably been trained on all public GitHub code, so the evaluation is tainted anyway.
Are you guys affiliated to https://poolside.fm/ or https://poolsuite.net?
This is cool!
I used to work on post-training & evals. it's really hard to make a good eval set and catch all forms of reward hacking. Excited to see more from poolside!
It was an interesting read - perhaps I misunderstood the part about blocking GitHub, but is not possible just to block it from accessing that specific repo?
I was under the impression that swe-bench (and I guess most other benchmarks) were supposed to be run offline?
I get that you may accidentally include something in local git history, but it feels off to me to run these kinds of benchmarks online.