logoalt Hacker News

hmmmmmmmmmmmmmmyesterday at 2:22 PM5 repliesview on HN

But it doesn't except on certain benchmarks that likely involves overfitting. Open source models are nowhere to be seen on ARC-AGI. Nothing above 11% on ARC-AGI 1. https://x.com/GregKamradt/status/1948454001886003328


Replies

meffmaddyesterday at 2:26 PM

Have you ever used an open model for a bit? I am not saying they are not benchmaxxing but they really do work well and are only getting better.

show 1 reply
irthomasthomasyesterday at 6:57 PM

This could be a good thing. ARC-AGI has become a target for America labs to train on. But there is no evidence that improvements on ARC performance translate to other skills. In fact there is some evidence that it hurts performance. When openai trained a version of o1 on ARC it got worse at everything else.

AbstractGeoyesterday at 6:49 PM

That's a link from July of 2025, so, definitely not about the current releaase.

show 1 reply
Zababayesterday at 2:48 PM

Has the difference between performance in "regular benchmarks" and ARC-AGI been a good predictor of how good models "really are"? Like if a model is great in regular benchmarks and terrible in ARC-AGI, does that tell us anything about the model other than "it's maybe benchmaxxed" or "it's not ARC-AGI benchmaxxed"?

doodlesdevyesterday at 2:59 PM

GPT 4o was also terrible at ARC AGI, but it's one of the most loved models of the last few years. Honestly, I'm a huge fan of the ARC AGI series of benchmarks, but I don't believe it corresponds directly to the types of qualities that most people assess whenever using LLMs.