logoalt Hacker News

kshri24yesterday at 6:11 AM2 repliesview on HN

I don't think you can classify "public data in" as public domain. Public data could also include commercial licenses which forbid using it in any way other than what the license states. Just because the source is open for viewing does not necessarily mean it is OSL.

That's the core issue here. All models are trained on ALL source code that is publicly available irrespective of how it was licensed. It is illegal but every company training LLMs is doing it anyways.


Replies

fschuettyesterday at 11:42 AM

> It is illegal

Only (?) in America. In the EU, scraping is legal by default unless explicitly opted out with machine-readable instructions like robots.txt. That covers "training input". For training output, the rule is: "if the output is unrecognizable to the input, the license of the input does not matter" (otherwise, any project X could sue project Y for copyright infringement even if the projects only barely resemble each other). The cases where companies actually got sued were where the output was a direct copy or repetition of the input, even if an LLM was involved.

There is, however, a larger philosophical divide between the US and the EU based on history and religion. The US philosophy is highly individualistic, capitalistic, and considers "first-order principles." Copyright is a "property right": "I own this string of bits, you used them, therefore you owe me" (principle of absolute ownership).

Continental philosophy is more social and considers "second-order / causal effects." Copyright is a "personality right" that exists within a social ecosystem. The focus is on the effect of the action rather than a singular principle like "intellectual property." If the new code provides a secondary benefit to society and doesn't "hurt" the original creator's unique intellectual stamp, the law is inclined to view it as a new work.

In terms of legal sociology, America and Britain are more "individual-property-atomistic" thanks to their Protestant heritage, focusing on the rights of the individual (sola me, and my property, and God). Meanwhile, Europe was, at least to a large part, Catholic (esp. France), which focuses more on works, results, and effects on society to determine morality. While the states are officially secular, the heritage of this echoes in different definitions of what is considered "legal" or "moral", depending on which side of the ocean you are on.

thedevilslawyeryesterday at 6:17 AM

Copyright is not a blacklist but an allowlist of things kept aside for the holder. Everything else is free game. LLM ingestion comes under fair use so no worries. If someone can get their hand on it, nothing in law stops it from training ingestion.

We can debate if this law is moral. Like the GP I took agree public data in -> public domain out is what's right for society. Copyright as an artificial concept has gone on for long enough.

show 2 replies