That's something I have been wondering. If I as a human want to make a clean room reimplementation of some API or application, I must not have read the source code of the original implementation. I don't see why this shouldn't apply to LLMs as well. If an LLM might have been trained on the original source code, it should be considered "tainted".
> If I as a human want to make a clean room reimplementation of some API or application, I must not have read the source code of the original implementation.
That is the difference between necessary and sufficient. Clean-room is sufficient to guarantee avoiding copyright, but it is not necessary. The line legally is south of there, but that position was chosen because they didn’t want to crossing and it was easier to argue for legally in court.
tl;dr: clean room is overkill for avoiding copyright infringement
Yes, and realistically any code that LLMs produce is a derivative work of its training data. There's going to be a huge disaster licensing wise
I have absolutely no idea how LLMs got through anyone's legal departments, I guess the hope is that if everyone breaks the law enough, it'll just be fine