> the overwhelming majority of input it has in-fact seen somewhere in the corpus it was trained on.
But it thinks just great on stuff it wasn't trained on.
I give it code I wrote that is not in its training data, using new concepts I've come up with in an academic paper I'm writing, and ask it to extend the code in a certain way in accordance with those concepts, and it does a great job.
This isn't regurgitation. Even if a lot of LLM usage is, the whole point is that it does fantastically with stuff that is brand new too. It's genuinely creating new, valuable stuff it's never seen before. Assembling it in ways that require thinking.
I think you may think too highly of academic papers or more so that they oft still only have 1% in there.
I think it would be hard to prove that it's truly so novel that nothing similar is present in the training data. I've certainly seen in research that it's quite easy to miss related work even with extensive searching.