Given the decrease in the benchmark score from the correction, I don't think you can assume the...

alyxya • last Sunday at 12:09 AM • 0 replies • view on HN

Given the decrease in the benchmark score from the correction, I don't think you can assume they didn't check a single output. Clearly the model is still very capable and the model cheating its results didn't affect most of the benchmark.

alt Hacker News