> Second, clean data. MAI-Thinking-1 was trained on clean and appropriately licensed data, with AI-generated content excluded from pre-training. This matters for quality, provenance, and control. If we cannot account for what shaped a model, we cannot fully understand its behavior or credibly improve it.
Shots fired?
It would be interesting to see how far "clean data" can go on the scaling laws.
It's interesting because their last model series (Phi) was based around the thesis that high-quality synthetic data is better than a large pre-training corpus.
I doubt any lab would say otherwise, they all _claim_ to use licensed data
I'd assume it's not up to par with Qwen-3.5 then, which has been distilling Claude, and the quality of the model is probably a direct result of that.
Interesting. Wasn't their previous attempt (Phi) trained mostly on synthetic data?
all the labs "clean" their pretraining data, and you can have your pretraining data to be minimally ai generated but also spam synthetic post-training data
I'm interested how much "Clean Data" is synthetic data from "unclean" models...
I would really like to see what "appropriately licensed data" means. Cannot imagine they didn't copy all open repo's on GitHub, and can't imagine they asked for permission, or are reproducing license texts from these repo's now. It sounds hand wavy.
P.S. A fairly basic website otherwise, but it unfortunately seems to be hacking scroll for no good reason.