logoalt Hacker News

Freak_NLtoday at 1:13 PM2 repliesview on HN

How do you think those models get trained? You can only get so far with Wikipedia, Reddit, and non-fiction works like books and academic papers.


Replies

tossandthrowtoday at 1:30 PM

Have a look at this article: https://www.washingtonpost.com/technology/interactive/2023/a...

NY Times is 0.06% of common crawl.

These news media outlets provide a drop in the ocean worth of information. Both qualitatively and quantitatively.

The news / media industry is really just trying to hold on to their lifeboat before inevitably becoming entirely irrelevant.

(I do find this sad, but it is like the reality - I can already now get considerably better journalism using LLMs than actual journalists - both click bait stuff and high quality stuff)

show 4 replies
RugnirVikingtoday at 1:17 PM

How does the entire textual corpus of say, new York times compare to all novels? Each article is a page of text, maybe two at most? There certainly are an awful lot of articles. But it's hard to imagine it is much more than a couple hundred novels. There must be thousands of novels released each year

show 1 reply