On the public set of 25 problems. These are intended for development and testing, not evaluation. There are 110 private problems for actual evaluation purposes, and the ARC-AGI-3 paper says "the public set is materially easier than the private set".
Benchmarks on public tests are too easy to game. The model owners can just incorporate the answers in to the dataset. Only the private problems actually matter.
Benchmarks on public tests are too easy to game. The model owners can just incorporate the answers in to the dataset. Only the private problems actually matter.