That's an interesting hypothesis : that LLM are fundamentally unable to produce original code.
Do you have papers to back this up ? That was also my reaction when i saw some really crazy accurate comments on some vibe coded piece of code, but i couldn't prove it, and thinking about it now i think my intuition was wrong (ie : LLMs do produce original complex code).
The whole "reproduces training data vebatim" is a red herring.
It reproduces _patterns from the training data_, sometimes including verbatim phrases.
The work (to discover those patterns, to figure out what works and what does not, to debug some obscure heisenbug and write a blog post about it, ...) was done by humans. Those humans should be compensated for their work, not owners of mega-corporations who found a loophole in copyright.
Pick up a book about programming from seventies or eighties that was unlikely to be scanned and feed into LLM. Take a task from it and ask LLM to write a program from it that even a student can solve within 10 minutes. If the problem was not really published before, LLM fails spectacularly.
> that LLM are fundamentally unable to produce original code.
What about humans? Are humans capable of producing completely original code or ideas or thoughts?
As the saying goes, if you want to create something from scratch, you have to start by inventing the universe.
Human mind works by noticing patterns and applying them in different contexts.
I have a very anecdotal, but interesting, counterexample.
I recently asked Gemini 3 Pro to create an RSS feed reader type of experience by using XSLT to style and layout an OPML file. I specifically wanted it to use a server-side proxy for CORS, pass through caching headers in the proxy to leverage standard HTTP caching, and I needed all feed entries for any feed in the OPML to be combined into a single chronological feed.
It initially told multiple times that it wasn't possible (it also reminded me that Google is getting rid of XSLT). Regardless, after reiterating that it is possible multiple times it finally decided to make a temporary POC. That POC worked on the first try, with only one follow up to standardize date formatting with support for Atom and RSS.
I obviously can't say the code was novel, though I would be a bit surprised if it trained on that task enough for it to remember roughly the full implementation and still claimed it was impossible.
I think the burden of proof is on the people making the original claim (that LLMs are indeed spitting out original code).
No, the thing needing proof is the novel idea: that LLMs can produce original code.
We can solve that question in an intuitive way: if human input is not what is driving the output then it would be sufficient to present it with a fraction of the current inputs, say everything up to 1970 and have it generate all of the input data from 1970 onwards as output.
If that does not work then the moment you introduce AI you cap their capabilities unless humans continue to create original works to feed the AI. The conclusion - to me, at least - is that these pieces of software regurgitate their inputs, they are effectively whitewashing plagiarism, or, alternatively, their ability to generate new content is capped by some arbitrary limit relative to the inputs.