logoalt Hacker News

modelesstoday at 3:33 AM1 replyview on HN

On the public set of 25 problems. These are intended for development and testing, not evaluation. There are 110 private problems for actual evaluation purposes, and the ARC-AGI-3 paper says "the public set is materially easier than the private set".


Replies

SchemaLoadtoday at 3:34 AM

Benchmarks on public tests are too easy to game. The model owners can just incorporate the answers in to the dataset. Only the private problems actually matter.

show 1 reply