You should check out "model collapse". It seems that an abundance of content, that is more and more AI generated these days, may not be a viable option. There is also a vast amount of data that is increasingly going private or behind paywalls
>You should check out "model collapse". It seems that an abundance of content, that is more and more AI generated these days, may not be a viable option.
Doom-saying about "model collapse" is kind of funny when OpenAI and Anthropic are mad at Chinese model makers for "distilling" their models, ie. using their outputs to train their own models.
People love harping on this one, but model collapse hasn't turned out to be an issue in practice.