> Second, clean data. MAI-Thinking-1 was trained on clean and appropriately licensed data, wit...

keeda • yesterday at 7:55 PM • 7 replies • view on HN

> Second, clean data. MAI-Thinking-1 was trained on clean and appropriately licensed data, with AI-generated content excluded from pre-training. This matters for quality, provenance, and control. If we cannot account for what shaped a model, we cannot fully understand its behavior or credibly improve it.

Shots fired?

It would be interesting to see how far "clean data" can go on the scaling laws.

Replies

foresterre • yesterday at 9:20 PM

I would really like to see what "appropriately licensed data" means. Cannot imagine they didn't copy all open repo's on GitHub, and can't imagine they asked for permission, or are reproducing license texts from these repo's now. It sounds hand wavy.

P.S. A fairly basic website otherwise, but it unfortunately seems to be hacking scroll for no good reason.

➕ show 3 replies

supermdguy • yesterday at 9:33 PM

It's interesting because their last model series (Phi) was based around the thesis that high-quality synthetic data is better than a large pre-training corpus.

vdfs • yesterday at 8:21 PM

I doubt any lab would say otherwise, they all _claim_ to use licensed data

➕ show 2 replies

swalsh • yesterday at 9:45 PM

I'd assume it's not up to par with Qwen-3.5 then, which has been distilling Claude, and the quality of the model is probably a direct result of that.

andai • yesterday at 10:02 PM

Interesting. Wasn't their previous attempt (Phi) trained mostly on synthetic data?

vanuatu • yesterday at 10:05 PM

all the labs "clean" their pretraining data, and you can have your pretraining data to be minimally ai generated but also spam synthetic post-training data

onlyrealcuzzo • yesterday at 7:59 PM

I'm interested how much "Clean Data" is synthetic data from "unclean" models...

➕ show 3 replies

alt Hacker News

Replies