> Language data is among the most rich and direct reflections of human cognitive processes that we have available.
This is both true and irrelevant. Written records can capture an enormous quantity of the human experience in absolute terms while simultaneously capturing a miniscule portion of the human experience in relative terms. Even if it's the best "that we have available" that doesn't mean it's fit for purpose. In other words, if you had a human infant and did nothing other than lock it in a windowless box and recite terabytes of text at it for 20 years, you would not expect to get a well-adjusted human on the other side.
Empirically, the capability gains from piping non-language data into pre-training are modest. At best.
I take that as a moderately strong signal against that "miniscule portion" notion. Clearly, raw text captures a lot.
If we're looking at biologicals, then "human infant" is a weird object, because it falls out of the womb pre-trained. Evolution is an optimization process - and it spent an awful lot of time running a highly parallel search of low k-complexity priors to wire into mammal brains. Frontier labs can only wish they had the compute budget to do this kind of meta-learning.
Humans get a bag of computational primitives evolved for high fitness across a diverse range of environments - LLMs get the pit of vaguely constrained random initialization. No wonder they have to brute force their way out of it with the sheer amount of data. Sample efficiency is low because we're paying the inverse problem tax on every sample.