logoalt Hacker News

rpdillontoday at 5:32 PM1 replyview on HN

How did you draw those conclusions? They don't seem to be in line with court rulings (i.e. Anthropic), which hold that training is fair use. Code is being treated the same as any other copyrighted content that is used for training, from blog posts to PR announcements from companies and everything in between. Of course the blog posts are PR announcements have their copyright held by their authors, with no license provided at all, so if OSS code being used in training is a violation, then so would everything being trained on (to a first approximation...public domain works excepted). But no court has every taken that position to my knowledge.

There's just so much confusion around this. In this thread alone:

* Distillation is legal under copyright; the violations would come as ToS violations, which is contract law, not copyright law.

* Training is legal as well, so long as the original material was obtained legally.

* Moving code off of GitHub doesn't change any of this: AI companies are free to download your git repo no matter where it is hosted, just like they can any other content on a publicly accessible website.

* Liability comes into the picture when the models are used to infringe copyright in their output. We'll have to see the outcome of the NYT case here, but that is proceeding at a glacial pace.

I am not a lawyer; I'm an interested amateur that's been following the saga for years. I wish the discussion here on HN were more nuanced.

If anyone has legal updates that render any of the above incorrect, I'd love a pointer to the decisions. One area I'm particularly weak is the legal status in countries that are not the US: I don't follow those laws nearly as carefully, nor the court cases brought.


Replies

bayindirhtoday at 5:55 PM

I have written about this numerous times, so I won't repeat myself with the long form writing. Maybe I need to keep a list of comments somewhere, so I can reference them. I digress...

In short:

- GPL code requires attribution and sharing of code. Models strip license, so GPL is effectively violated.

- Source available licenses are "for your eyes" only, so training on source available code is also violates said code's licenses.

- MIT requires attribution, but forgetting it has no consequences, so it's a more gray area.

About moving from GitHub:

- Some public repositories provide visible and invisible anti-scraping protections. So it's not always that easy.

- GPL says I need to share code to the people who downloads the application itself, so I can move to cathedral model.

Moreover:

- US Government has a stance of "If we need to take permission for everything, AI industry will die". Hence, as an outsider, the court rulings have no weight in my eyes. They are taking stance to enable and not hinder the industry. If one reads Fair Use doctrine, it's very possible to rule otherwise. OpenAI's whole non-profit research arm was an instrument to circumvent Fair Use doctrine's "earn money from copyrighted works" clause and support "we only do research pinky promise" requirement of the said doctrine.

When courts said "go ahead, we're not looking", people started to torrent e-books (ahem Meta ahem) to train models or buy/cut/scan/ocr books to train their models (Anthropic).

So the situation is left murky to allow Silicon Valley to thrive. Not to protect people's blood, sweat and tears. These works are provided by peasants anyway, so why bother.

Addenda: Courts said models' outputs can't be copyrighted. So, copyrighted code gets in, non-copyrightable code gets out. It's effectively license-washing.

show 1 reply