> So it needs to know facts, albeit the currently accepted ones. Knowing the facts is a good way to compression data.
It's not a compression engine - it's just a statistical predictor.
Would it do better if it was incentivized to compress (i.e training loss rewarded compression as well as penalizing next-word errors)? I doubt it would make a lot of difference - presumably it'd end up throwing away the less frequently occurring "outlier" data in favor of keeping what was more common, but that would result in it throwing away the rare expert opinion in favor of retaining the incorrect vox pop.
Both compression engines and llm work by assigning scores to the next token. If you can guess the probability distribution of the next token you have a near perfect text compressor, and a near perfect llm. Yeah in the real world they have different trade-offs.
Here's a paper by deep mind. https://arxiv.org/pd7f/2309.10668 - titled LANGUAGE MODELING IS COMPRESSION