We need a new license that forbids all training. That is the only way to stop big corporations from doing this.
So if you put this hypothetical license on spam emails, then spam filters can't train to recognize them? I'm sure ad companies would LOVE it.
Fair use doesn’t need a license, so it doesn’t matter what you put in the license.
Generally speaking licenses give rights (they literally grant license). They can’t take rights away, only the legislature can do that.
Wouldn't it be still legal to train on the data due to fair use?
By that logic, humans would also be prevented from “training” on (i.e. learning from) such code. Hard to see how this could be a valid license.
How is that enforceable against the fly-by-night startups?
Would such a license fall under the definition of free software? Difficult to say. Counter-proposition: a license which permits training if the model is fully open.
We need a ruling that LLM generated code enters public domain automatically and can't be covered by any license.
To my understanding, if the material is publicly available or obtained legally (i.e., not pirated), then training a model with it falls under fair use, at least in the US and some other jurisdictions.
If the training is established as fair use, the underlying license doesn't really matter. The term you added would likely be void or deemed unenforceable if someone ever brought it to a court.