I miss the pre-LLM days when you could make a decent argument that having any unnecessary data was just a liability. Now all anybody thinks is “more data for the AI!”
10+ years ago companies were hoovering up data for ML - trying to find correlations in high-dimensionality data. Mostly the results were garbage but occasionally you hit on a real, unexpected phenomenon.
Nowadays you just throw all the data into a black box and believe whatever it says blindly.
Data hoarding predates LLMs. There where other machine learning methods which also needed data for training.
Were you not around for the Big Data heyday a decade ago?