Empirically, the capability gains from piping non-language data into pre-training are modest. At bes...

ACCount37 • yesterday at 9:37 PM • 0 replies • view on HN

Empirically, the capability gains from piping non-language data into pre-training are modest. At best.

I take that as a moderately strong signal against that "miniscule portion" notion. Clearly, raw text captures a lot.

If we're looking at biologicals, then "human infant" is a weird object, because it falls out of the womb pre-trained. Evolution is an optimization process - and it spent an awful lot of time running a highly parallel search of low k-complexity priors to wire into mammal brains. Frontier labs can only wish they had the compute budget to do this kind of meta-learning.

Humans get a bag of computational primitives evolved for high fitness across a diverse range of environments - LLMs get the pit of vaguely constrained random initialization. No wonder they have to brute force their way out of it with the sheer amount of data. Sample efficiency is low because we're paying the inverse problem tax on every sample.

alt Hacker News