I'm interested how much "Clean Data" is synthetic data from "unclean" models...
> with AI-generated content excluded from pre-training.
> without distillation from third-party models
sounds like zero unless they are lying.
“ We trained it from the ground up on enterprise grade, clean and commercially licensed data, without distillation from third-party models.”
So, laundered data?