logoalt Hacker News

rogerrogerryesterday at 3:24 PM1 replyview on HN

They’ll never reveal the data, because that would reveal this is all built on stolen work.


Replies

simonwyesterday at 3:40 PM

Some of the models DO reveal the data, and it's still built on "stolen work" in that it's unlicensed scrapes of the Web. Here's an example:

https://huggingface.co/allenai/OLMo-2-0325-32B

Here's one of their training mixes: https://huggingface.co/datasets/allenai/dolma3_pool - which includes 8 trillion tokens from Common Crawl.