You haven't addressed the parent's concern at all, which is that what the LLM was trained on, not what was fed into its context window. The Linux driver is almost certainly in the LLM's training data.
Also, the "spec" that the LLM wrote to simulate the "clean-room" technique is full of C code from the Linux driver.
Okay, so will companies now vibe-code a Linux-like license-washed kernel, to get rid of the GPL?
> The Linux driver is almost certainly in the LLM's training data.
Yes, and? Isn't Stallmans first freedom the "freedom to study the source code" (FSF Freedom I)? Where does it say I have to be a human to study it? If you argue "oh but you may only read / train on the source code if you are intending to write / generate GPL code", then you're admitting that the GPL effectively is only meant for "libre" programmers in their "libre" universe and it might as well be closed-source. If a human may study the code to extract the logic (the "idea") without infringing on the expression, why is it called "laundering" if a machine does it?
Let's say I look (as a human) at some GPL source code. And then I close the browser tab and roughly re-implement from memory what I saw. Am I now required to release my own code as GPL? More extreme: If I read some GPL code and a year later I implement a program that roughly resembles what I saw back then, then I can, in your universe, be sued because only "libre programmers" may read "libre source code".
In German copyright law, there is a concept of a "fading formula": if the creative features of the original work "fade away" behind the independent content of the new work to the point of being unrecognizable, it constitutes a new work, not a derivative, so the input license doesn't matter. So, for LLMs, even if the input is GPL, proprietary, whatever: if the output is unrecognizable from the input, it does not matter.
fair point, I glossed over that distinction. context separation \!= training data separation. if the driver was in training data, the "spec from observation" pass is already contaminated before the coding pass begins. the phoenix bios parallel actually required strict information separation at every stage -- here that's not achievable since you can't retrain the model. so the legal protection is much weaker than I implied.
This is speculation, but I suspect the training data argument is going to be a real loser in the courtroom. We’re getting out of the region where memorization is a big failure mode for frontier models. They are also increasingly trained on synthetic text, whose copyright is very difficult to determine.
We also so far have yet to see anyone successfully sue over software copyright with LLMs—-this is a bit redundant, but we’ve also not seen a user of one of these models be sued for output.
Maybe we converge on the view of the US copyright office which is that none of this can be protected.
I kind of like that one as a future for software engineers, because it forces them all at long last to become rules lawyers. If we disallow all copyright protection for machine generated code, there might be a cottage industry of folks who provide a reliably human layer that is copyrightable. Like Boeing, they will have to write to the regulator and not to the spec. I feel that’s a suitable destination for a discipline. That’s had it too good for too long.