logoalt Hacker News

Groxxyesterday at 10:04 PM0 repliesview on HN

There is the nuance that much code that is available publicly (which includes a GIGANTIC amount of that "written by people in their spare time" stuff) is put there for the explicit goal of showing other people all the details so they can read, reuse, and modify it. Open-source licenses in some form are incredibly popular, though the details vary, and seeing your side project in a product that 100k people use is usually just neat, not "you stole from me".

Artworks have their relatively-popular creative-commons stuff, and some of those follow a similar "do whatever" vibe, but I far more frequently see "attribution required" which generally requires it at the point of use, i.e. immediately along-side the art-piece. And if it's something where someone saw your work once and made something different separately, the license generally does not apply. LLMs have no way to do that kind of attribution though, and hammer out stuff that looks eerily familiar but isn't pixel-precise to the original, so it feels like and probably is an unapproved use of their work.

The code equivalent of this is usually "if you have source releases, include it there" or a very few have the equivalent of "please shove a mention somewhere deep in a settings screen that nobody will tap on". Using that code for training is I think relatively justifiable. The licenses matter (and have clearly been broadly ignored, which should not be allowed) but if it wasn't prohibited, it's generally allowed, and if you didn't want that you would need to choose a restrictive license or not publish it.

Plus, like, artists generally are their style, in practical terms. So copying their style is effectively impersonation. Coders on the other hand often intentionally lean heavily on style erasing tools like auto-formatters and common design patterns and whatnot, so their code blends cleanly in more places rather than sounding like exclusively "them".

---

I'm generally okay with permissive open source licensed code being ingested and spat back out in a useful product. That's kinda the point of those licenses. If it requires attribution, it gets murky and probably leans towards "no" - it's clearly not a black-box re-implementation, the LLMs are observing the internals and sometimes regurgitate it verbatim and that is generally not approved when humans do it.

Do I think the biggest LLM companies are staying within the more-obviously-acceptable licenses? Hell no. So I avoid them.

Do I trust any LLM business to actually stick to those licenses? ... probably not right now. But one could exist. Hopefully it'd still have enough training data to be useful.