I wouldn't be surprised if LLM companies end up sponsoring certain platforms / news sites, in exchange for being able to use their content of course.
THe problem with LLMs is that a single token (or even a single book) isn't really worth that much. It's not like human writing, where we'll pay far more for "Harry Potter" and "The Art of Computer Programming" than some romance trash with three reads on Kindle.
LLM companies already do this. Both Reddit and Stack Overflow turned to shit (but much more profitable shit) when they sold their archives to the AI companies for lots of money.
This is perhaps true from the "language model" point of view, but surely from the "knowledge" point of view an LLM is prioritising a few "correct" data sources?
I wonder about this a lot when I ask LLMs niche technical questions. Often there is only one canonical source of truth. Surely it's somehow internally prioritising the official documentation? Or is it querying the documentation in the background and inserting it into the context window?