Isn't the Sora video model a ViT with spatiotemporal inputs (so they've found a way to com...

energy123 • today at 5:38 PM • 0 replies • view on HN

Isn't the Sora video model a ViT with spatiotemporal inputs (so they've found a way to compress that down), but at the same time LeCunn wouldn't consider that a world model?

alt Hacker News