It seems to me you could generate a lot of fresh information from running every youtube video...

jazzyjackson • 01/21/2025 • 4 replies • view on HN

It seems to me you could generate a lot of fresh information from running every youtube video, every hour of TV on archive.org, every movie on the pirate bay -- do scene by scene image captioning + high quality whisper transcriptions (not whatever junk auto-transcription YouTube has applied), and use that to produce screenplays of everything anyone has ever seen.

I'm not sure why I've never heard of this being done, it would be a good use of GPUs in between training runs.

Replies

jensvdh • 01/21/2025

The fact that OpenAI can just scrape all of Youtube and Google isn't even taking legal action or attempting to stop it is wild to me. Is Google just asleep?

➕ show 1 reply

airstrike • 01/21/2025

Don't forget every hour of news broadcasting, of which we likely won't run out any time soon. Plus high quality radio

ilaksh • 01/22/2025

I think that this is the obvious path to more robust models -- grounding language on video.

miltonlost • 01/21/2025

> a lot of fresh information from running every youtube video

EVERY youtube video?? Even the 9/11 truther videos? Sandy Hook conspiracy videos? Flat earth? Even the blatantly racist? This would be some bad training data without some pruning.

➕ show 1 reply

alt Hacker News

Replies