I bet they'll only train on the internet snapshot from now, before LLMs.
Additional non-internet training material will probably be human created, or curated at least.
Nope. Pretraining runs have been moving forward with internet snapshots that include plenty of LLM content.
This only makes sense if the percentage of LLM hallucinations is much higher than the percentage of things written on line being flat wrong (it's definitely not).