logoalt Hacker News

brandonbyesterday at 10:37 PM1 replyview on HN

Very cool. I learned something new about why EMA (exponential moving average) is needed:

> EMA-based training dynamics like JEPA’s don’t optimize any smooth mathematical function, yet they provably converge to useful, non-collapsed representations.

All the papers say EMA avoids “representation collapse” without justifying it. Didn’t realize there were any theoretical results here.


Replies

soraki_soladeadtoday at 12:15 AM

Roughly, when you train a model to make its predictions align to its own predictions in some way, you create a scenario where the simplest "correct" solution is to output a single value under diverse inputs, aka representation collapse. This guarantees that your predicted representations agree, which is technically what you want it to do but it's degenerate.

EMA helps because it changes more slowly than the learning network which prevents rapid collapse by forcing the predictions to align to what a historical average would predict. This is a harder and more informative task because the model can't trivially output one value and have it match the EMA target so the model learns more useful representations.

EMA has a long history in deep learning (many GANs use it, TD-learning like DQN, many JEPA papers, etc.) so authors often omit defense of it due to over-familiarity or sometimes cargo culting. :)