The transcript does seem to overlook post-training steps like Reinforcement Learning with Verifiable Rewards (RLVR) (but I'll certainly won't claim that Rich Sutton is unaware of such things; RLVR has a very narrow set of evaluation approaches).
I wonder if this is a precursor to Keen Tech leaning into David Silver's Ineffable Intelligence approach.
This was exactly what I was thinking of. RLVR is the secret sauce behind o3 and its many successors.
Its the secret sauce behind why the current models are so great at coding and soon to be unbeatable at math.
LLMs can pose many questions and if they are easily verifiable, fine tune very heavily. A lot of the world models discussion will inevitable lean into simulations as verification.