On the public set of 25 problems. These are intended for development and testing, not evaluation. Th...

modeless • today at 3:33 AM • 1 reply • view on HN

On the public set of 25 problems. These are intended for development and testing, not evaluation. There are 110 private problems for actual evaluation purposes, and the ARC-AGI-3 paper says "the public set is materially easier than the private set".

Replies

SchemaLoad • today at 3:34 AM

Benchmarks on public tests are too easy to game. The model owners can just incorporate the answers in to the dataset. Only the private problems actually matter.

➕ show 1 reply

alt Hacker News

Replies