I'm excited for the big jump in ARC-AGI scores from recent models, but no one should think for a second this is some leap in "general intelligence".
I joke to myself that the G in ARC-AGI is "graphical". I think what's held back models on ARC-AGI is their terrible spatial reasoning, and I'm guessing that's what the recent models have cracked.
Looking forward to ARC-AGI 3, which focuses on trial and error and exploring a set of constraints via games.
The average ARC AGI 2 score for a single human is around 60%.
"100% of tasks have been solved by at least 2 humans (many by more) in under 2 attempts. The average test-taker score was 60%."
Wouldn't you deal with spatial reasoning by giving it access to a tool that structures the space in a way it can understand or just is a sub-model that can do spatial reasoning? These "general" models would serve as the frontal cortex while other models do specialized work. What is missing?
Agreed. I love the elegance of ARC, but it always felt like a gotcha to give spatial reasoning challenges to token generators- and the fact that the token generators are somehow beating it anyway really says something.