People love harping on this one, but model collapse hasn't turned out to be an issue in practice.
There's been symptoms of it that have shown up such as the colloquially called "piss filter" and the the anime mole nose problem, but so far they've been symptoms rather than a fatal expression of a disease. That they are symptoms however shows they can be terminal if exploited properly and profusely. So far we haven't seen anyone capable of the "profusely" part.
It doesn't seem like anything has changed to preclude it as a possible outcome yet.
I don't really understand why model collapse would happen.
I understand that if I have an AI model and then feed it its own responses it will degrade in performance. But that's not what's happening in the wild though - there are extra filtering steps in-between. Users upvote and downvote posts, people post the "best" AI generated content (that they prefer), the more human sounding AI gets more engagement etc. All of these things filter AI output, so it's not the same thing as:
AI out -> AI in
It is:
AI out -> human filter -> AI in
And at that point the human filter starts acting like a fitness function for a genetic algorithm. Can anyone explain how this still leads to model collapse? Does the signal in the synthetic data just overpower the human filter?
“It’s been a whole year or two and nothing bad has happened, checkmate doomers!”
It’s pretty shocking how much web content and forum posts are either partially or completely LLM-generated these days. I’m pretty sure feeding this stuff back into models is widely understood to not be a good thing.
The past is not a good predictor of future performance.
It feels like if it does happen, it will take a lot longer to show up. Also, I doubt they would ship a model that turns out this corrupted stuff.
It wont mean we see the model collapse in public, more we struggle to get to the next quality increase.