logoalt Hacker News

hydrogen7800yesterday at 2:21 PM4 repliesview on HN

Right, that success story is only because there was "organic" (for lack of a better term) information from an original source. What happens when all information is nth generation AI feedback with all links to the original source lost?

Edit: A question from AI/LLM ignorance- Can the source database for an LLM be one-way, in that it does not contain output from itself, or other LLMs? I can imagine a quarantined database used for specific applications that remains curated, but this seems impossible on the open internet.


Replies

bigthymeryesterday at 3:04 PM

> Can the source database for an LLM be one-way, in that it does not contain output from itself, or other LLMs?

I think, for public internet data, we can only be reasonably confident for information before the big release of ChatGPT.

nsvd2yesterday at 6:20 PM

Yes, people have likened pre-LLM Internet content to low-background steel.

If in the hypothetical future the continual learning problem gets solved, the AI could just learn from the real world instead of publications and retain that data.

nprateemyesterday at 8:54 PM

One reason why Google made that algorithm to watermark AI output

black_puppydogyesterday at 2:46 PM

That's exactly why text written before the first LLMs has a premium on it these days. So no, all major models suffer from slop in their training data.