Exactly. In principle, at least, the only way to overfit to Arc-AGI is to actually be that smart.
Edit: if you disagree, try actually TAKING the Arc-AGI 2 test, then post.
With this kind of thing, the tails ALWAYS come apart, in the end. They come apart later for more robust tests, but "later" isn't "never", far from it.
Having a high IQ helps a lot in chess. But there's a considerable "non-IQ" component in chess too.
Let's assume "all metrics are perfect" for now. Then, when you score people by "chess performance"? You wouldn't see the people with the highest intelligence ever at the top. You'd get people with pretty high intelligence, but extremely, hilariously strong chess-specific skills. The tails came apart.
Same goes for things like ARC-AGI and ARC-AGI-2. It's an interesting metric (isomorphic to the progressive matrix test? usable for measuring human IQ perhaps?), but no metric is perfect - and ARC-AGI is biased heavily towards spatial reasoning specifically.
Is it different every time? Otherwise the training could just memorize the answers.
It's very much a vision test. The reason all the models don't pass it easily is only because of the vision component. It doesn't have much to do with reasoning at all
Completely false. This is like saying being good at chess is equivalent to being smart.
Look no farther than the hodgepodge of independent teams running cheaper models (and no doubt thousands of their own puzzles, many of which surely overlap with the private set) that somehow keep up with SotA, to see how impactful proper practice can be.
The benchmark isn’t particularly strong against gaming, especially with private data.