logoalt Hacker News

SchemaLoadtoday at 3:34 AM1 replyview on HN

Benchmarks on public tests are too easy to game. The model owners can just incorporate the answers in to the dataset. Only the private problems actually matter.


Replies

sanxiyntoday at 3:37 AM

In this case the code is public and you can see they are not cheating in that sense.

show 4 replies