They know Google has a ton of data to train LLMs on. Recently I have been asking YouTube's ne...

bel8 • yesterday at 9:22 PM • 8 replies • view on HN

They know Google has a ton of data to train LLMs on.

Recently I have been asking YouTube's new AI about some videos ("when is Steam metrics mentioned in the video?" for example), which means they also index videos. This is an unthinkable amount of data.

I'm actually impressed at how bad Alphabet is with LLMs since they invented the thing as we know AND have all the data to train on, yet OpenAI and Anthropic are eating their pie.

Replies

mitchell_h • yesterday at 9:28 PM

I use anthropic's models daily, and sometimes switch to Gemini. Google is losing the marketing front BADLY, but their AI service is surprisingly great. It's far cheaper than anthropic for one. and for my kind of research it's just better.

➕ show 5 replies

onlyrealcuzzo • today at 12:17 AM

I wouldn't be surprised if Google's logs alone are a substantial portion of all data created daily...

jonwachob91 • yesterday at 9:30 PM

I've also asked the youtube ai about when some things are mentioned in videos, and upon verification the ai is just hallucinating.

tekacs • yesterday at 9:45 PM

I don't think they 'index' videos, per se. They just point the model at the video's transcript on demand when you ask a question, I believe. Doesn't change any of your conclusions, though. You're absolutely right, they have an absolute ton of data.

f0rgot • today at 12:34 AM

Are you sure it’s not using transcripts? That would be equally useful but technologically less impressive.

MagicMoonlight • yesterday at 9:28 PM

Everyone mocked them for paying for YouTube for years with no real income. Now it’s the most valuable data source in the world.

ecommerceguy • today at 12:19 AM

pretty sure its only for videos with cc enabled.

CamperBob2 • yesterday at 9:25 PM

Not only that, but the same webmasters who try to shoo AI crawlers away actively court Google's bots.

➕ show 1 reply

alt Hacker News

Replies