The current state of the theory that GPL propagates to AI models

144 points • by jonymo • today at 12:48 PM • 181 comments • view on HN

Comments

Great article but I don't really agree with their take on GPL regarding this paragraph:

> The spirit of the GPL is to promote the free sharing and development of software [...] the reality is that they are proceeding in a different vector from the direction of code sharing idealized by GPL. If only the theory of GPL propagation to models walks alone, in reality, only data exclusion and closing off to avoid litigation risks will progress, and there is a fear that it will not lead to the expansion of free software culture.

The spirit of the GPL is the freedom of the user, not the code being freely shared. The virality is a byproduct to ensure the software is not stolen from their users. If you just want your code to be shared and used without restrictions, use MIT or some other license.

> What is important is how to realize the “freedom of software,” which is the philosophy of open source

Freedom of software means nothing. Freedoms are for humans not immaterial code. Users get the freedom to enjoy the software how they like. Washing the code through an AI to purge it from its license goes against the open source philosophy. (I know this may be a mistranslation, but it goes in the same direction as the rest of the article).

I also don't agree with the arguments that since a lot of things are included in the model, the GPL code is only a small part of the whole, and that means it's okay. Well if I take 1 GPL function and include it in my project, no matter its size, I would have to license as GPL. Where is the line? Why would my software which only contains a single function not be fair use?

➕ show 2 replies

palata • today at 2:47 PM

Genuine question: if I train my model with copyleft material, how do you prove I did?

Like if there is no way to trace it back to the original material, does it make sense to regulate it? Not that I like the idea, just wondering.

I have been thinking for a while that LLMs are copyright-laundering machines, and I am not sure if there is anything we can do about it other than accepting that it fundamentally changes what copyright is. Should I keep open sourcing my code now that the licence doesn't matter anymore? Is it worth writing blog posts now that it will just feed the LLMs that people use? etc.

➕ show 13 replies

zamadatix • today at 1:50 PM

The article goes deep into these two cases deemed most relevant but really there are a wide swath of similar cases all focused around defining sharper borders than ever around what is essentially the question "exactly when does it become copyright violation" with plenty of seemingly "obvious" answers which quickly conflict with each other.

I also have the feeling it will be much like Google LLC v. Oracle America, Inc., much of this won't really be clearly resolved until the end if the decade. I'd also not ve surprised if seemingly very different answers ended up bubbling up in the different cases, driven by the specifics of the domain.

Not a lawyer, just excited to see the outcomes :).

➕ show 1 reply

myrmidon • today at 2:21 PM

I honestly think that the most extreme take that "any output of an LLM falls under all the copyright of all its training data" is not really defensible, especially when contrasted with human learning, and would be curious to hear conflicting opinions.

My view is that copyright in general is a pretty abstract and artificial concept; thus corresponding regulation needs to justifiy itself by being useful, i.e. encouraging and rewarding content creation.

/sidenote: Copyright as-is barely holds up there; I would argue that nobody (not even old established companies) is significantly encouraged or incentivised by potential revenue more than 20 years in the future (much less current copyright durations). The system also leads to bad ressource allocation, with almost all the rewards ending up at a small handful of most successful producers-- this effectively externalizes large portions of the cost of "raising" artists.

I view AI overlap under the same lense-- if current copyright rules would lead to undesirable outcomes (by making all AI training or use illegal/infeasible) then law/interpretation simply has to be changed.

➕ show 2 replies

phplovesong • today at 2:16 PM

We need a new license that forbids all training. That is the only way to stop big corporations from doing this.

➕ show 8 replies

graemep • today at 1:45 PM

The article repeatedly treats license and contract as though they are the same, even though the sidebar links to a post that discusses the difference.

A lot of it boils down to whether training an LLM is a breach of copyright of the training materials which is not specific to GPL or open source.

➕ show 4 replies

ljlolel • today at 2:09 PM

And then also to all code made from the GPL’d ai model?

➕ show 1 reply

pessimizer • today at 2:22 PM

I might be crazy, and I'd love to hear from somebody who knows about this, but I've been assuming that AI companies have been pulling GPL code out of the training material specifically to avoid this.

Corporations have always talked about the virality of GPL, sometimes but not always to the point of exaggeration, you'd think that after getting the proof of concept done the AI companies would be running away at full speed from setting a bomb like that in their goldmine.

Putting in tons of commonly read books and scientific papers is safer, they can just eventually cross-license with the massive conglomerates that own everything. But the GPL is by nature hostile, and has been openly and specifically hostile from the beginning. MIT and Apache, etc. you can just include a fistful of licenses to download, or even come up with architectures that track names to add for attribution-ware. But the GPL will obviously (and legitimately) claim to have relicensed the entire model and maybe all its output (unless they restricted it to LGPL.)

Wouldn't you just pull it out?

➕ show 4 replies

simgt • today at 1:52 PM

What triggers me is how insistant Claude Code is on adding "co-authored by Claude" in commits, in spite of my settings and an instruction in CLAUDE.md. I wish all these tech bros were as willing to credit the human shoulders on which their products are built. But they'd be much less successful in our current system if they were that kind of people.

➕ show 1 reply

dmezzetti • today at 2:11 PM

As someone who has spent a fair amount of time developing open source software, I will say I genuinely dislike copyleft and GPL.

For those who are into freedom, I don't see how dictating how you use what you build in such a manner is in the spirit of free and open.

Just my opinion on it, to each their own on the matter.

➕ show 6 replies

rvnx • today at 1:47 PM

GPL and copyright in general don't apply to billionaires, so pretty much a non-topic.

It's just a side cost of doing business, because asking for forgiveness is cheaper and faster than asking for permission.

➕ show 2 replies

pclmulqdq • today at 1:48 PM

I thought the whole concept of a viral license was legally questionable to begin with. There haven't been cases about this, as far as I know, and GPL virality enforcement has just been done by the community.

➕ show 3 replies

uyzstvqs • today at 3:36 PM

Training is not redistribution. It's the exact same as you as a person learning to program from proprietary secret code, and then writing your own original code independently. Even if you repeat patterns and methods you've picked up from that proprietary learning material, it is by no means redistribution. The practical differentiator here is that you do not access the proprietary material during the creation of your own original work, similar in principle to a clean-room design. With AI/ML, it matters that training data is not accessed during inference, which it's not.

The other factor of copyright, which is relevant, is how material is obtained. If the material is publicly accessible without protection, you have no reasonable expectation to exclusive control over its use. If you don't want AI training to be done on your work, you need to put access to it behind explicit authentication with a legally-binding user agreement prohibiting that use-case. Do note that this would lose your project's status as open-source.

➕ show 2 replies

alt Hacker News

The current state of the theory that GPL propagates to AI models

Comments