It is just much more efficient to train on synthetic data. When you train on real data, all you know is the next token. With synthetic data you know the probability distribution of the next token; this results in a multiplier effect, and sometimes this effect is dramatic.