logoalt Hacker News

LiamPowelllast Saturday at 6:27 AM2 repliesview on HN

> I want to be charitable and assume genuine oversight rather than "benchmaxxing", probably an easy to miss thing if you are new to benchmarking

I don't doubt that it's an oversight, it does however say something about the researchers when they didn't look at a single output where they would have immediately caught this.


Replies

domoritzlast Saturday at 10:56 AM

So many data probes would be solved if everyone looked at a few outputs instead of only metrics.

alyxyalast Sunday at 12:09 AM

Given the decrease in the benchmark score from the correction, I don't think you can assume they didn't check a single output. Clearly the model is still very capable and the model cheating its results didn't affect most of the benchmark.