logoalt Hacker News

exasperaitedtoday at 2:40 PM1 replyview on HN

> I might be crazy, and I'd love to hear from somebody who knows about this, but I've been assuming that AI companies have been pulling GPL code out of the training material specifically to avoid this.

Haha no.

https://windsurf.com/blog/copilot-trains-on-gpl-codeium-does...

And just in the last two days, AI generating LGPL headers (which it could not do if identifying LGPL code was pulled from the codebase) and misattributing authors:

https://devclass.com/2025/11/27/ocaml-maintainers-reject-mas...


Replies

pessimizertoday at 5:35 PM

Thanks for the links.

That first link shows people actively pulling out GPL code in 2023 and marketing around that fact, though. That's not great evidence that they're not doing it now, especially if testing for if GPL code is still in there seems to be as easy as prompting with an incomplete piece of it.

I'd think that companies could amass a collection of all known GPL code and test for it regularly in order to refine their methods for keeping it out.

> (which it could not do if identifying LGPL code was pulled from the codebase)

Are you sure about this? Linking to LGPL code is fine afaik. And why not train on code that linked to universally available libraries that are legal to use? Seems like one might even prefer it.

Seems like this was rejected for size and slop reasons, not licensing. If the submitter of the PR isn't even fixing possibly hallucinated author's names, it's obvious that they didn't really read it. Debugging vibe coded stuff is like finding an indeterminate number of needles in a haystack.