Was Alphago's move 37 original?
In the last step of training LLMs, reinforcement learning from verified rewards, LLMs are trained to maximize the probability of solving problems using their own output, depending on a reward signal akin to winning in Go. It's not just imitating human written text.
Fwiw, I agree that world models and some kind of learning from interacting with physical reality, rather than massive amounts of digitized gym environments is likely necessary for a breakthrough for AGI.