Yeah, this is something I've been thinking about too. LLMs have basically profited from "stealing" (arguably) user-generated content from a time when there were no LLMs. In the LLM era there won't be a new Stack Overflow to train LLMs on going forward.
We're getting closer to Dead Internet Theory too where a lot of accounts, particularly on Twitter, are just LLMs. I imagine it's a huge problem on Reddit too. Just people farming karma or otherwise involved in influence campaigns or simply grifting to ad revenue.
So we're going to get to a point where the corpus we train LLMs on will itself just be filled with LLM slops. Self-reinforcing slop. Is that the future?
It's happening here too, I saw dang hint that they're not even responding to a lot of questions about it anymore because of the volume of the problem.
If you browse with showdead on you'll be seeing a lot more of what look like reasonable comments greyed out.
It's been studied,LLM that feed on its own data regress and it becomes very bad after a few generations.