At this point, they all using each other because so much of the new content they are scraping for da...

parineum • last Wednesday at 12:17 PM • 4 replies • view on HN

At this point, they all using each other because so much of the new content they are scraping for data is generated.

These models will converge and plateau because the datasets are only going to get worse as more of their content is incestuous.

Replies

jsheard • last Wednesday at 2:35 PM

The default Llama 4 system prompt even instructs it to avoid using various ChatGPT-isms, presumably because they've already scraped so much GPT-generated material that it noticably skews their models output.

sovietmudkipz • last Wednesday at 1:58 PM

I recall that AI trained on AI output over many cycles eventually becomes something akin to noise texture as the output degrades rapidly.

Won’t most AI produced content put out into the public be human curated, thus heavily mitigating this degradation effect? If we’re going to see a full length AI generated movie it seems like humans will be heavily involved, hand holding the output and throwing out the AI’s nonsense.

➕ show 1 reply

wkat4242 • last Wednesday at 12:27 PM

Yes indeed some studies were already done on this.

zackangelo • last Wednesday at 2:54 PM

There might be a plateau coming but I’m not sure that will be the reason.

It seems counterintuitive but there is some research suggesting that using synthetic data might actually be productive.

➕ show 1 reply

alt Hacker News

Replies