> I want to be charitable and assume genuine oversight rather than "benchmaxxing", prob...

LiamPowell • last Saturday at 6:27 AM • 2 replies • view on HN

> I want to be charitable and assume genuine oversight rather than "benchmaxxing", probably an easy to miss thing if you are new to benchmarking

I don't doubt that it's an oversight, it does however say something about the researchers when they didn't look at a single output where they would have immediately caught this.

Replies

domoritz • last Saturday at 10:56 AM

So many data probes would be solved if everyone looked at a few outputs instead of only metrics.

alyxya • last Sunday at 12:09 AM

Given the decrease in the benchmark score from the correction, I don't think you can assume they didn't check a single output. Clearly the model is still very capable and the model cheating its results didn't affect most of the benchmark.

alt Hacker News

Replies