Yeah, this is something I've been thinking about too. LLMs have basically profited from "s...

jmyeet • yesterday at 9:22 PM • 2 replies • view on HN

Yeah, this is something I've been thinking about too. LLMs have basically profited from "stealing" (arguably) user-generated content from a time when there were no LLMs. In the LLM era there won't be a new Stack Overflow to train LLMs on going forward.

We're getting closer to Dead Internet Theory too where a lot of accounts, particularly on Twitter, are just LLMs. I imagine it's a huge problem on Reddit too. Just people farming karma or otherwise involved in influence campaigns or simply grifting to ad revenue.

So we're going to get to a point where the corpus we train LLMs on will itself just be filled with LLM slops. Self-reinforcing slop. Is that the future?

Replies

aucisson_masque • yesterday at 10:26 PM

It's been studied,LLM that feed on its own data regress and it becomes very bad after a few generations.

mattmanser • yesterday at 9:34 PM

It's happening here too, I saw dang hint that they're not even responding to a lot of questions about it anymore because of the volume of the problem.

If you browse with showdead on you'll be seeing a lot more of what look like reasonable comments greyed out.

alt Hacker News

Replies