logoalt Hacker News

bel8yesterday at 9:22 PM8 repliesview on HN

They know Google has a ton of data to train LLMs on.

Recently I have been asking YouTube's new AI about some videos ("when is Steam metrics mentioned in the video?" for example), which means they also index videos. This is an unthinkable amount of data.

I'm actually impressed at how bad Alphabet is with LLMs since they invented the thing as we know AND have all the data to train on, yet OpenAI and Anthropic are eating their pie.


Replies

mitchell_hyesterday at 9:28 PM

I use anthropic's models daily, and sometimes switch to Gemini. Google is losing the marketing front BADLY, but their AI service is surprisingly great. It's far cheaper than anthropic for one. and for my kind of research it's just better.

show 5 replies
onlyrealcuzzotoday at 12:17 AM

I wouldn't be surprised if Google's logs alone are a substantial portion of all data created daily...

jonwachob91yesterday at 9:30 PM

I've also asked the youtube ai about when some things are mentioned in videos, and upon verification the ai is just hallucinating.

tekacsyesterday at 9:45 PM

I don't think they 'index' videos, per se. They just point the model at the video's transcript on demand when you ask a question, I believe. Doesn't change any of your conclusions, though. You're absolutely right, they have an absolute ton of data.

f0rgottoday at 12:34 AM

Are you sure it’s not using transcripts? That would be equally useful but technologically less impressive.

MagicMoonlightyesterday at 9:28 PM

Everyone mocked them for paying for YouTube for years with no real income. Now it’s the most valuable data source in the world.

ecommerceguytoday at 12:19 AM

pretty sure its only for videos with cc enabled.

CamperBob2yesterday at 9:25 PM

Not only that, but the same webmasters who try to shoo AI crawlers away actively court Google's bots.

show 1 reply