logoalt Hacker News

keedayesterday at 7:55 PM7 repliesview on HN

> Second, clean data. MAI-Thinking-1 was trained on clean and appropriately licensed data, with AI-generated content excluded from pre-training. This matters for quality, provenance, and control. If we cannot account for what shaped a model, we cannot fully understand its behavior or credibly improve it.

Shots fired?

It would be interesting to see how far "clean data" can go on the scaling laws.


Replies

foresterreyesterday at 9:20 PM

I would really like to see what "appropriately licensed data" means. Cannot imagine they didn't copy all open repo's on GitHub, and can't imagine they asked for permission, or are reproducing license texts from these repo's now. It sounds hand wavy.

P.S. A fairly basic website otherwise, but it unfortunately seems to be hacking scroll for no good reason.

show 3 replies
supermdguyyesterday at 9:33 PM

It's interesting because their last model series (Phi) was based around the thesis that high-quality synthetic data is better than a large pre-training corpus.

vdfsyesterday at 8:21 PM

I doubt any lab would say otherwise, they all _claim_ to use licensed data

show 2 replies
swalshyesterday at 9:45 PM

I'd assume it's not up to par with Qwen-3.5 then, which has been distilling Claude, and the quality of the model is probably a direct result of that.

andaiyesterday at 10:02 PM

Interesting. Wasn't their previous attempt (Phi) trained mostly on synthetic data?

vanuatuyesterday at 10:05 PM

all the labs "clean" their pretraining data, and you can have your pretraining data to be minimally ai generated but also spam synthetic post-training data

onlyrealcuzzoyesterday at 7:59 PM

I'm interested how much "Clean Data" is synthetic data from "unclean" models...

show 3 replies