And sitting right next to the data and compute factors in every cross entropy loss equation is the e...

FromTheFirstIn • last Friday at 11:42 PM • 2 replies • view on HN

And sitting right next to the data and compute factors in every cross entropy loss equation is the entropy of the language, which is just a fixed constant. There’s such a hard cap on cross entropy loss training and I never hear it come up!

Replies

aspenmartin • last Saturday at 1:03 AM

Right but that is context dependent; it drops with context length, depends on tokenizer, etc. It doesn't end up being super relevant, despite the fact that if you look at the loss for real models it's relatively large in absolute terms. But that doesn't really matter -- all of the interesting stuff happens once you start getting closer and closer to it. You've gotten past all of the easy tokens that dominate the entropy and now you get to the really challenging ones that we care about (like e.g. very difficult reasoning about a next step).

➕ show 1 reply

317070 • today at 8:23 AM

Right, and what happens at that limit is most exciting! A model that has a cross entropy at that limit for a data stream of text, produces a stream of text that is both theoretically and practically indistinguishable from the original stream.

And so if the datastream has been produced by something intelligent, the resulting model is indistinguishable from that intelligence. That is the whole compression idea behind artificial intelligence.

The limit is not a bug, it's a feature!

➕ show 1 reply

alt Hacker News

Replies