> It is not the scale that matters here, in your example, but intent. With 1 joint, you want to smoke yourself. With 400, you very possibly want to sell it to others. Scale in itself doesnt matter, scale matters only as to the extent it changes what your intention may be.
It sounds then like you're saying that scale does indeed matter in this context, as using every single piece of writing in existence isn't being slurped up purely to learn, it's being slurped up to make a profit.
Do you think they'd be able to offer a usefull LLM if the model was trained only what what an average person could read in a lifetime?
It's common knowledge among LLM experts that the current capabilities of LLMs are triggered as emergent properties of training transformers on reams and reams of data.
That is intent of scale. To trigger LLMs to reach this point of "emergence". Whether or not it's AGI is a debate I'm not willing to entertain but everyone pretty much agrees that there's a point where the scale flips from a transformer being an autocomplete machine to something more than that.
That is legal basis for why companies would go for scale with LLMs. It's the same reason why people are allowed to own knives even though knives are known to be useful for murder (as a side effect).
So technically speaking these companies have legal runway in terms of intent. Making an emergent and helpful AI assistant is not illegal, but also making a profit isn't illegal either.