Benchmarks on public tests are too easy to game. The model owners can just incorporate the answers in to the dataset. Only the private problems actually matter.
In this case the code is public and you can see they are not cheating in that sense.
In this case the code is public and you can see they are not cheating in that sense.