Knowing the nature of a test ahead of time, building out your capabilities and tooling before entering the exam hall when your peers don't have that advantage, makes you a cheater.
https://en.wikipedia.org/wiki/Goodhart's_law
> Any observed statistical regularity will tend to collapse once pressure is placed upon it for control purposes.
Note that this uses a harness so it doesn't qualify for the official ARC-AGI-3 leaderboard
According to the authors the harness isn't ARC-AGI specific though https://x.com/agenticasdk/status/2037335806264971461
On the public set of 25 problems. These are intended for development and testing, not evaluation. There are 110 private problems for actual evaluation purposes, and the ARC-AGI-3 paper says "the public set is materially easier than the private set".
we constantly underestimate the power of inference scaffolding. I have seen it in all domains: coding, ASR, ARC-AGI benchmarks you name it. Scaffolding can do a lot! And post-training too. I am confident our currently pre-trained models can beat this benchmark over 80% with the right post-training and scaffolding. That being said I don't think ARC-AGI proves much. It is not a useful task at all in the wild. it is just a game; a strange and confusing one. For me this is just a pointless pseudo-academic exercise. Good to have, but by no means measures intelligence and even less utility of a model.
Anybody used this Agentica of theirs?
[dead]
The fact that this was on the set of training problems with a custom harness basically makes the headline a lie.
What if you give opus the same harness? Do people even care about meaningful comparisons any more or is it all just “numbers go up”