I don't know why people still get wrapped around the axle of "training data".
Basically every benchmark worth it's salt uses bespoke problems purposely tuned to force the models to reason and generalize. It's the whole point of ARC-AGI tests.
Unsurprisingly Gemini 3 pro performs way better on ARC-AGI than 2.5 pro, and unsurprisingly it did much better in pokemon.
The benchmarks, by design, indicate you can mix up the switch puzzle pattern and it will still solve it.
I don't know why people still get wrapped around the axle of "training data".
Basically every benchmark worth it's salt uses bespoke problems purposely tuned to force the models to reason and generalize. It's the whole point of ARC-AGI tests.
Unsurprisingly Gemini 3 pro performs way better on ARC-AGI than 2.5 pro, and unsurprisingly it did much better in pokemon.
The benchmarks, by design, indicate you can mix up the switch puzzle pattern and it will still solve it.