This is already well known, all these AI benchmarks use a different model to judge whether or not the solution was correct.
It’s… remarkably poor, and as demonstrated in the paper, easily gamed. Worst yet, these benchmarks teach AIs to be very short-sighted and hyper-focused on completing the task, rather than figuring out the best solution.