Why don't they ask their premier model to generate a bench for them? Jokes aside, a benchmark...

threepts • today at 3:31 PM • 5 replies • view on HN

Why don't they ask their premier model to generate a bench for them?

Jokes aside, a benchmark I look forward to is ARC-AGI-3. I tried out their human simulation, and it feels very reasoning heavy.

Leaderboard: https://arcprize.org/leaderboard

(Most premier models don't even pass 5 percent.)

Replies

falcor84 • today at 3:36 PM

They focus on minimizing the number of moves and don't allow any harness whatsoever, putting the bar extremely high. The current top verified contender (Claude Opus 4.6) is at only 0.45%. But with how new it is, I expect a lot of improvement in the next generation of models.

➕ show 1 reply

sowbug • today at 4:38 PM

Why don't they ask their premier model to generate a bench for them?

It's not a crazy idea. Have the older model interview the newer one and then ask both (or maybe a third referee model) which one they think is smarter. Repeat 100x with different seeds. The percentage of times both sides agree the newer model won is the score.

alansaber • today at 3:33 PM

Very (reasoning) heavy benchmarks do seem like the way to go, being the hardest to game.

xtracto • today at 4:43 PM

Can AI write a problem so difficult that even AI cannot solve?

Hehe

➕ show 1 reply

therealdrag0 • today at 4:35 PM

[dead]

alt Hacker News

Replies