They know Google has a ton of data to train LLMs on.
Recently I have been asking YouTube's new AI about some videos ("when is Steam metrics mentioned in the video?" for example), which means they also index videos. This is an unthinkable amount of data.
I'm actually impressed at how bad Alphabet is with LLMs since they invented the thing as we know AND have all the data to train on, yet OpenAI and Anthropic are eating their pie.
I wouldn't be surprised if Google's logs alone are a substantial portion of all data created daily...
I've also asked the youtube ai about when some things are mentioned in videos, and upon verification the ai is just hallucinating.
I don't think they 'index' videos, per se. They just point the model at the video's transcript on demand when you ask a question, I believe. Doesn't change any of your conclusions, though. You're absolutely right, they have an absolute ton of data.
Are you sure it’s not using transcripts? That would be equally useful but technologically less impressive.
Everyone mocked them for paying for YouTube for years with no real income. Now it’s the most valuable data source in the world.
pretty sure its only for videos with cc enabled.
Not only that, but the same webmasters who try to shoo AI crawlers away actively court Google's bots.
I use anthropic's models daily, and sometimes switch to Gemini. Google is losing the marketing front BADLY, but their AI service is surprisingly great. It's far cheaper than anthropic for one. and for my kind of research it's just better.