I know for a fact that all SOTA models have linux source code in them, intentionally or not which means that they should follow the GPL license terms and open-source part of the models which have created derivative works out of it.
yes, this is indirectly hinting that during training the GPL tainted code touches every single floating point value in a model making it derivative work - even the tokenizer isn't immune to this.
When you say “in” them, are you referring to their training data, or their model weights, or the infrastructure required to run them?
> the tokenizer isn't immune to this
A tokenizer's set of tokens isn't copyrightable in the first place, so it can't really be a derivative work of anything.