How many models are only trained on legal[0] data? Adobe's Firefly model is one commercial model I can think of.
[0] I think the data can be licensed, and not just public domain; e.g. if the creators are suitably compensated for their data to be ingested
> How many models are only trained on legal[0] data?
None, since 'legal' for AI training is not yet defined, but Olma is trained on the Dolma 3 dataset, which is
1. Common crawl
2. Github
3. Wikipedia, Wikibooks
4. Reddit (pre-2023)
5. Semantic Scholar
6. Project Gutenberg
* https://arxiv.org/pdf/2402.00159