From 0% to 36% on Day 1 of ARC-AGI-3

82 points • by lairv • today at 1:32 AM • 43 comments • view on HN

Comments

The fact that this was on the set of training problems with a custom harness basically makes the headline a lie.

What if you give opus the same harness? Do people even care about meaningful comparisons any more or is it all just “numbers go up”

➕ show 3 replies

padolsey • today at 7:19 AM

Knowing the nature of a test ahead of time, building out your capabilities and tooling before entering the exam hall when your peers don't have that advantage, makes you a cheater.

➕ show 1 reply

gslin • today at 5:18 AM

https://en.wikipedia.org/wiki/Goodhart's_law

> Any observed statistical regularity will tend to collapse once pressure is placed upon it for control purposes.

➕ show 1 reply

lairv • today at 1:32 AM

Note that this uses a harness so it doesn't qualify for the official ARC-AGI-3 leaderboard

According to the authors the harness isn't ARC-AGI specific though https://x.com/agenticasdk/status/2037335806264971461

➕ show 5 replies

modeless • today at 3:33 AM

On the public set of 25 problems. These are intended for development and testing, not evaluation. There are 110 private problems for actual evaluation purposes, and the ARC-AGI-3 paper says "the public set is materially easier than the private set".

➕ show 1 reply

bytesandbits • today at 6:13 AM

we constantly underestimate the power of inference scaffolding. I have seen it in all domains: coding, ASR, ARC-AGI benchmarks you name it. Scaffolding can do a lot! And post-training too. I am confident our currently pre-trained models can beat this benchmark over 80% with the right post-training and scaffolding. That being said I don't think ARC-AGI proves much. It is not a useful task at all in the wild. it is just a game; a strange and confusing one. For me this is just a pointless pseudo-academic exercise. Good to have, but by no means measures intelligence and even less utility of a model.

➕ show 2 replies

esafak • today at 1:49 AM

Anybody used this Agentica of theirs?

AbanoubRodolf • today at 3:19 AM

[dead]

alt Hacker News

From 0% to 36% on Day 1 of ARC-AGI-3

Comments