I always assumed that these benchmarks would happen in a sandbox. I'm surprised that no one realized this sooner.
Running benchmarks at scale and protecting against reward hacking is non-trivial.
I'm surprised anyone took them seriously in the first place.
Running benchmarks at scale and protecting against reward hacking is non-trivial.