> The current corpus used for training includes virtually all known material. This is just tota...

krainboltgreene • today at 5:25 PM • 1 reply • view on HN

> The current corpus used for training includes virtually all known material.

This is just totally incorrect. It's one of those things everyone just assumes, but there's an immense amount of known material that isn't even digitized, much less in the hands of tech companies.

Replies

drob518 • today at 5:30 PM

What large caches of undigitized content exists? Surely, not everything has been digitized, but I can’t think it’s much in percentage terms.

➕ show 2 replies

alt Hacker News

Replies