The fact that this was on the set of training problems with a custom harness basically makes the headline a lie.
What if you give opus the same harness? Do people even care about meaningful comparisons any more or is it all just “numbers go up”
Would the single sentence „Imagine you are a regular computer player and accustomed to the usual elements of games“ count as a harness?
Does it matter though? If it accomplishes the task, it accomplishes the task. Everyone uses a harness anyway, and finding the best harness is relevant. Also perhaps this hints at something bigger, i.e.: we're wasting our time focusing on the model when we could be focusing on the harness.
When you're on the hunt for VC cash "numbers go up" is the main criteria.