I worked extensively on ARC AGI before and one thing is SURE as hell. OpenAI and Gemini in particula...

usernametaken29 • today at 7:46 AM • 3 replies • view on HN

I worked extensively on ARC AGI before and one thing is SURE as hell. OpenAI and Gemini in particular use this as marketing material. You can correlate the benchmark release with stock price increase. They feed synthetic datasets of ARC into their models to boost the numbers. There is no doubt in my mind Gemini is no better than DeepSeek other than being specifically fine tuned for ARC AGI. Heck, they even say so and they say they have paid annotations for ARC. Again, economic incentives. In terms of whether these models are actually better at the benchmarks, likely not. See ARC 3, where the gap is diminishingly small.

Replies

versteegen • today at 9:55 AM

I've also worked extensively on ARC AGI 1/2, and I mainly agree. Marketing and training. Performance of LLMs on ARC is most importantly a function of training on grid/table-like data. It doesn't have to be specifically synthetic ARC data though. Training an LLM to be better at perceiving grid-like arrangements of data in a spatial way like an image, rather than just tabular, is hugely useful for things outside of ARC benchmarks, though it's a narrow skill. Hence, I'm sure they do it. I want them to do that. I believe the labs when they say they didn't train specifically for ARC-AGI 1/2 (where did Google say otherwise? I don't see it). But it does not mean the models are getting better at general purpose reasoning. They were already plenty good enough at that. You can describe ARC images in words and reason about it using a level of intelligence LLMs have had for years: they're designed to be easy! LLMs just couldn't reason about image-like grids very well.

gpt5 • today at 7:59 AM

ARC-AGI isn't perfect, but it helps demonstrates the gap. I'm sure all companies optimize their models for this benchmark given its dominance.

energy123 • today at 8:15 AM

Why do you think DeepSeek isn't also fine tuned on ARC AGI? Maybe they're more fine tuned on ARC AGI but still get worse scores. There's no way to know.

➕ show 1 reply

alt Hacker News

Replies