logoalt Hacker News

From 0% to 36% on Day 1 of ARC-AGI-3

82 pointsby lairvtoday at 1:32 AM43 commentsview on HN

Comments

stephantultoday at 6:44 AM

The fact that this was on the set of training problems with a custom harness basically makes the headline a lie.

What if you give opus the same harness? Do people even care about meaningful comparisons any more or is it all just “numbers go up”

show 3 replies
padolseytoday at 7:19 AM

Knowing the nature of a test ahead of time, building out your capabilities and tooling before entering the exam hall when your peers don't have that advantage, makes you a cheater.

show 1 reply
gslintoday at 5:18 AM

https://en.wikipedia.org/wiki/Goodhart's_law

> Any observed statistical regularity will tend to collapse once pressure is placed upon it for control purposes.

show 1 reply
lairvtoday at 1:32 AM

Note that this uses a harness so it doesn't qualify for the official ARC-AGI-3 leaderboard

According to the authors the harness isn't ARC-AGI specific though https://x.com/agenticasdk/status/2037335806264971461

show 5 replies
modelesstoday at 3:33 AM

On the public set of 25 problems. These are intended for development and testing, not evaluation. There are 110 private problems for actual evaluation purposes, and the ARC-AGI-3 paper says "the public set is materially easier than the private set".

show 1 reply
bytesandbitstoday at 6:13 AM

we constantly underestimate the power of inference scaffolding. I have seen it in all domains: coding, ASR, ARC-AGI benchmarks you name it. Scaffolding can do a lot! And post-training too. I am confident our currently pre-trained models can beat this benchmark over 80% with the right post-training and scaffolding. That being said I don't think ARC-AGI proves much. It is not a useful task at all in the wild. it is just a game; a strange and confusing one. For me this is just a pointless pseudo-academic exercise. Good to have, but by no means measures intelligence and even less utility of a model.

show 2 replies
esafaktoday at 1:49 AM

Anybody used this Agentica of theirs?

AbanoubRodolftoday at 3:19 AM

[dead]