I tend to agree, but I wonder… if you train an LLM on only GPL code, and it generates non-determinis...

christophilus • last Friday at 4:17 PM • 3 replies • view on HN

I tend to agree, but I wonder… if you train an LLM on only GPL code, and it generates non-deterministic predictions derived from those sources, how do you prove it’s in violation?

Replies

FeepingCreature • last Friday at 5:13 PM

You don't because it isn't, unless it actually copies significant amounts of text.

Algorithms can not be copyrighted. Text can be copyrighted, but reading publicly available text and then learning from it and writing your own text is just simply not the sort of transformation that copyright reserves to the author.

Now, sometimes LLMs do quote GPL sources verbatim (if they're trained wrong). You can prove this with a simple text comparison, same as any other copyright violation.

layer8 • last Friday at 5:09 PM

By knowing that its output is derived from GPL sources?

alt Hacker News

Replies