Reinforcement learning from self-play/AlphaWhatever? Nah must just be datasets. :)
Self-play gives you a large explosion of data.
Big difference between a perfect information, completely specified zero sum game and the real world.
As a simple analogy, read out the following sentence multiple times, stressing a different word each time.
"I never said she stole my money"
Note how the meaning changes and is often unique?
That is a lens I to the frame problem and it's inverse, the specification problem.
The above problem quickly becomes tower-complete, and recent studies suggest that RL is reinforcing or increasing the weight of existing patterns.
As the open domain frame problem and similar challenges are equivalent to HALT, finding new ways to extract useful information will be important for generalization IMHO.
Synthetic data is useful, but not a complete solution, especially for tower problems.
And architecture stuff like actually useful long context. Whatever they did with gemini 2.5 is miles ahead in long context useful results compared to the previous models. I'd be very surprised if gemini 2.5 is "just" gemini 1 w/ better data.