That ARC AGI score is a little suspicious. That's a really tough for AI benchmark. Curious if there were improvements to the test harness because that's a wild jump in general problem solving ability for an incremental update.
I don’t think their words mean just about anything, only the behavior of the models.
Still waiting of Full Self Driving myself.
They're clearly building better training datasets and doing extensive RL on these benchmarks over time. The out of distribution performance is still awful.