I assume they took the actual repos’ licenses info account. I don’t understand why they should ask for permission when the license would already allow for it.
Which licenses allow usage for training? MIT, BSD, etc likely do. But I would expect it gets weird for all the various copyleft licences.
Almost all licenses have requirements to redistribute copies of the work, or derivatives thereof. Even permissive licenses do. It's very little to ask when open source dev's provided thousands of hours of free work.
For example, the Apache 2.0 license requires in just 4.c:
Just because they're tokenized and transformed into a probabilistic mapping, doesn't suddenly mean that they weren't copied.I find it morally unethical that they (likely) just ingest IP of all open source repo's without asking, but also importantly without any attribution.
Let me also note that I'm not against LLM's in general. But I do think training on open source must be opt-in, and I look forward to a world with actually ethical, and traceable (i.e. on what they were trained on, like a bill of materials (BOM)), models.