The reason they are called "world models" is because the internal representation of what they display represents a "world" instead of a video frame or image. The model needs to "understand" geometry and physics to output a video.
Just because there are errors in this doesn't mean it isn't significant. If a machine learning model understands how physical objects interact with each other that is very useful.