It is just much more efficient to train on synthetic data. When you train on real data, all you know...

credit_guy • today at 12:23 PM • 0 replies • view on HN

It is just much more efficient to train on synthetic data. When you train on real data, all you know is the next token. With synthetic data you know the probability distribution of the next token; this results in a multiplier effect, and sometimes this effect is dramatic.

[1] https://arxiv.org/pdf/2504.14772v1

alt Hacker News