I'm sure with benchmarks like these future LLMs will be optimized to hide regressions by "fixing" test framework too
Isn't misalignment great.
Isn't misalignment great.