That original model collapse paper has largely been misunderstood and in practice, this is only true...

roadside_picnic • 07/30/2025 • 1 reply • view on HN

That original model collapse paper has largely been misunderstood and in practice, this is only true if you're not curating the generated data at all. The original paper even specifies (emphasis mine):

> We find that indiscriminate use of model-generated content in training causes irreversible defects in the resulting models, in which tails of the original content distribution disappear. [0]

In practice nobody is "indiscriminately" using model output to fine-tune models since that doesn't even make sense. Even if you're harvesting web data generated by LLMs, that data has in fact been curated by it's acceptance on whatever platform you found it on is a form of curation.

There was a very recent paper Supervised Fine Tuning on Curated Data is Reinforcement Learning (and can be improved) [1] whose content is pretty well summarized by the title. So long as the data is curated in some way, you are providing more information to the model and the results should improve somewhat.

0. https://www.nature.com/articles/s41586-024-07566-y

1. https://www.arxiv.org/pdf/2507.12856

edit: updated based on cooksnoot's comment

Replies

cootsnuck • 07/30/2025

There's multiple papers on model collapse. Being able to avoid model collapse is different from it "being disproven".

If you just mean its risk has been over exaggerated and/or over simplified then yea, you'd have a point.

➕ show 1 reply

alt Hacker News

Replies