> How many models are only trained on legal[0] data? None, since 'legal' for AI train...

Eisenstein • today at 8:20 AM • 1 reply • view on HN

> How many models are only trained on legal[0] data?

None, since 'legal' for AI training is not yet defined, but Olma is trained on the Dolma 3 dataset, which is

1. Common crawl

2. Github

3. Wikipedia, Wikibooks

4. Reddit (pre-2023)

5. Semantic Scholar

6. Project Gutenberg

austinjp • today at 9:17 AM

Nice, I hadn't heard of this. For convenience, here are HuggingFace models trained on Olma:

➕ show 1 reply

alt Hacker News