This seems closely related to the problem of model collapse [1][2][3], where LLMs lose the tails of the distribution, and so when you recursively train on the output of an LLM, or otherwise feed the output back into the input in subsequent stages, you lose the precision and diversity that human authors bring to the work. Eventually everything regresses to the mean and anything that would've made the content unique, useful, and differentiated gets lost.
My takeaway from this is that AI is a temporary phenomena, the end stage of the Internet age. It's going to destroy the Internet as we know it as well as much of the technological knowledge of the developed world, and then we're going to have to start fresh and rebuild everything we know. My takeaway is that I'm trying to use AI to identify and download the remaining sources of facts on the Internet, the human-authored stuff that isn't generated for engagement but comes from the era when people were just putting useful stuff online to share information.
[1] https://en.wikipedia.org/wiki/Model_collapse
[2] https://www.nature.com/articles/s41586-024-07566-y
[3] https://cacm.acm.org/blogcacm/model-collapse-is-already-happ...
There are plenty of AIs that are immune to this because they're trained on something that won't be flooded with slop. E.g. robotics, self-driving cars (both trained on real camera/sensor inputs) or programming/proof-assistant stuff (trained on things that are verifiable).