World models are not a new idea, it comes from the "model-free" and "model-based" reinforcement learning paradigms that have been around forever
Model-free paradigms (which we do now without world models) does not actually model what _happens_ when you take an action, they simply model how good or bad an action is. This is highly data inefficient but asymptotically performs much better than model-based RL because you don't have modeling biases.
Model-based RL, where world-models come in, models the transition matrix T(s, a, s') meaning, I'm in state s and I take action a, what is my belief about my new state? By doing this you can do long-term planning, so it's not just useful for robotics and video generation but for reasoning and planning more broadly. It's also highly data efficient, and right now, for robotics, that is absolutely the name of the game.
What you will see is: approximately zero robots, then approximately one crappy robot (once you get performance + reliability to jusssst cross the boundary where you can market it, even at a loss! and people will buy it and put it in their homes). Once that happens you get the magic: data flywheel for robotics, and things start _rapidly_ improving.
Robotics is where it is because it lacks the volume of data we have on the internet. for robotics today it's not only e.g. egocentric video that we need but also _sensor-specific_ and _robot-specific_ data (e.g. robot A has a different build + components than robot B)