> How many models are only trained on legal[0] data?
None, since 'legal' for AI training is not yet defined, but Olma is trained on the Dolma 3 dataset, which is
1. Common crawl
2. Github
3. Wikipedia, Wikibooks
4. Reddit (pre-2023)
5. Semantic Scholar
6. Project Gutenberg
Nice, I hadn't heard of this. For convenience, here are HuggingFace models trained on Olma:
https://huggingface.co/datasets/allenai/dolma
https://huggingface.co/models?dataset=dataset:allenai/dolma