logoalt Hacker News

christophiluslast Friday at 4:17 PM3 repliesview on HN

I tend to agree, but I wonder… if you train an LLM on only GPL code, and it generates non-deterministic predictions derived from those sources, how do you prove it’s in violation?


Replies

FeepingCreaturelast Friday at 5:13 PM

You don't because it isn't, unless it actually copies significant amounts of text.

Algorithms can not be copyrighted. Text can be copyrighted, but reading publicly available text and then learning from it and writing your own text is just simply not the sort of transformation that copyright reserves to the author.

Now, sometimes LLMs do quote GPL sources verbatim (if they're trained wrong). You can prove this with a simple text comparison, same as any other copyright violation.

layer8last Friday at 5:09 PM

By knowing that its output is derived from GPL sources?