I love how detailed and transparent the data set statistics are on the huggingface pages. https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B...
I've noticed that open models have made huge efficiency gains in the past several months. Some amount of that is explainable as architectural improvements but it seems quite obvious that a huge portion of the gains come from the heavy use of synthetic training data.
In this case roughly 33% of the training tokens are synthetically generated by a mix of other open weight models. I wonder if this trend is sustainable or if it might lead to model collapse as some have predicted. I suspect that the proliferation of synthetic data throughout open weight models has lead to a lot of the ChatGPT writing style replication (many bullet points, em dashes, it's not X but actually Y, etc).