Isn't the Sora video model a ViT with spatiotemporal inputs (so they've found a way to compress that down), but at the same time LeCunn wouldn't consider that a world model?