logoalt Hacker News

onlyrealcuzzoyesterday at 7:59 PM3 repliesview on HN

I'm interested how much "Clean Data" is synthetic data from "unclean" models...


Replies

bicxyesterday at 8:53 PM

So, laundered data?

ertgbnmyesterday at 8:19 PM

> with AI-generated content excluded from pre-training.

> without distillation from third-party models

sounds like zero unless they are lying.

show 3 replies
xavrileyyesterday at 8:01 PM

“ We trained it from the ground up on enterprise grade, clean and commercially licensed data, without distillation from third-party models.”

show 1 reply