logoalt Hacker News

LarsDu88today at 4:45 PM1 replyview on HN

I really hate the world model terminology, but the actual low level gripe between LeCunn and autoregressive LLMs as they stand now is the fact that the loss function needs to reconstruct the entirety of the input. Anything less than pixel perfect reconstruction on images is penalized. Token by token reconstruction also is biased towards that same level of granularity.

The density of information in the spatiotemporal world is very very great, and a technique is needed to compress that down effectively. JEPAs are a promising technique towards that direction, but if you're not reconstructing text or images, it's a bit harder for humans to immediately grok whether the model is learning something effectively.

I think that very soon we will see JEPA based language models, but their key domain may very well be in robotics where machines really need to experience and reason about the physical the world differently than a purely text based world.


Replies

energy123today at 5:38 PM

Isn't the Sora video model a ViT with spatiotemporal inputs (so they've found a way to compress that down), but at the same time LeCunn wouldn't consider that a world model?