Reasonably speaking, there is no way they can know how they plan to invest $500 billion dollars. The current generation of large language models basically use all human text thats ever been created for the parameters... not really sure where you go after than using the same tech.
The latest hype is around "agents", everyone will have agents to do things for them. The agents will incidentally collect real-time data on everything everyone uses them for. Presto! Tons of new training data. You are the product.
The new scaling vector is “test time compute” ie spending more compute in inference.
It seems to me you could generate a lot of fresh information from running every youtube video, every hour of TV on archive.org, every movie on the pirate bay -- do scene by scene image captioning + high quality whisper transcriptions (not whatever junk auto-transcription YouTube has applied), and use that to produce screenplays of everything anyone has ever seen.
I'm not sure why I've never heard of this being done, it would be a good use of GPUs in between training runs.
I think there is huge amount of corporate knowledge.
That's not really true - the current generation, as in "of the last three months", uses reinforcement learning to synthesize new training data for themselves: https://huggingface.co/deepseek-ai/DeepSeek-R1-Zero