I always wonder what happens when LLMs finally destroyed every source of information they crawl. After stack overflow and forums are gone and when there's no open source code anymore to improve upon. Won't they just canibalize themselves and slowly degrade?
Synthetic data. Like AlphaZero playing randomized games against itself, a future coding LLM would come up with new projects, or feature requests for existing projects, or common maintenance tasks for itself to execute. Its value function might include ease of maintainability, and it could run e2e project simulations to make sure it actually works.
That's a valid thought. AS AI generates a lot of content, some of which may be hallucinations, the new cycle of training will be probably using the old + the_new_AI_slop data, and as a result degrade the final result.
Unless the AIs find out where mistakes occur, and find this out in the code they themselves generate, your conclusion seems logically valid.
I guess there’ll be less collaboration and less sharing with the outside world, people will still collaborate/share but within smaller circles. It’ll bring an end to the era of sharing is caring interent as it doesn’t benefit anyone but few big players
I bet they'll only train on the internet snapshot from now, before LLMs.
Additional non-internet training material will probably be human created, or curated at least.
Does it matter? Hypothetically if these pre-training datasets disappeared, you can distill from the smartest current model, or have them write textbooks.
That idea is called model collapse https://en.wikipedia.org/wiki/Model_collapse
Some studies have shown that direct feedback loops do cause collapse but many researchers argue that it’s not a risk with real world data scales.
In fact, a lot of advancements in the open weight model space recently have been due to training on synthetic data. At least 33% of the data used to train nvidia’s recent nemotron 3 nano model was synthetic. They use it as a way to get high quality agent capabilities without doing tons of manual work.