We need LLMs that have a certificate of origin.
For instance a GPL LLM trained only on GPL code where the source data is all known, and the output is all GPL.
It could be done with a distributed effort.
I don't think the licensing issues are the main problem, but the spam.
Honestly, given that that GPL model would be far below SOTA in capabilities, what exactly would be its use-case? Why would anyone try to use an inferior LLM if they can get away with using a superior one?
Rather, LLMs that do NOT contain GPL code.
Not necessarily a bad idea, but I think the bigger issue here and now is the increasing assymmetry in effort between code submitter and reviewer, and the unsustainable review burden on the maintainers if nothing is done.