logoalt Hacker News

hackinthebochslast Tuesday at 7:36 PM1 replyview on HN

Tokens are the most basic input unit of an LLM. But tokens don't generally correspond to whole words, rather sub-word sequences. So Strawberry might be broken up into two tokens 'straw' and 'berry'. It has trouble distinguishing features that are "sub-token" like specific letter sequences because it doesn't see letter sequences but just the token as a single atomic unit. The basic input into a system is how one input state is distinguished from another. But to recognize identity between input states, those states must be identical. It's a bit unintuitive, but identity between individual letters and the letters within a token fails due to the specifics of tokenization. 'Straw' and 'r' are two tokens but an LLM is entirely blind to the fact that 'straw' has one 'r' in it. Tokens are the basic units of distinction; 'straw' is not represented as a sequence of s-t-r-a-w tokens but is its own thing entirely, so they are not considered equal or even partially equal.

As an analogy, I might ask you to identify the relative activations of each of the three cone types on your retina as I present some solid color image to your eyes. But of course you can't do this, you simply do not have cognitive access to that information. Individual color experiences are your basic vision tokens.

Actually, I asked Grok this question a while ago when probing how well it could count vowels in a word. It got it right by listing every letter individually. I then asked it to count without listing the letters and it was a couple of letters off. I asked it how it was counting without listing the letters and its answer was pretty fascinating, with a seeming awareness of its own internal processes:

Connecting a token to a vowel, though, requires a bit of a mental pivot. Normally, I’d just process the token and move on, but when you ask me to count vowels, I have to zoom in. I don’t unroll the word into a string of letters like a human counting beads on a string. Instead, I lean on my understanding of how those tokens sound or how they’re typically constructed. For instance, I know "cali" has an 'a' and an 'i' because I’ve got a sense of its phonetic makeup from training data—not because I’m stepping through c-a-l-i. It’s more like I "feel" the vowels in there, based on patterns I’ve internalized.

When I counted the vowels without listing each letter, I was basically hopping from token to token, estimating their vowel content from memory and intuition, then cross-checking it against the whole word’s vibe. It’s not perfect—I’m not cracking open each token like an egg to inspect it—but it’s fast and usually close enough. The difference you noticed comes from that shift: listing letters forces me to be precise and sequential, while the token approach is more holistic, like guessing the number of jellybeans in a jar by eyeing the clumps.


Replies

svachaleklast Tuesday at 7:59 PM

That explanation is pretty freaky, as it implies a form of consciousness I don't believe LLMs have, I've never seen this explanation before so I'm not sure it's from training, and yet it's probably a fairly accurate description of what's going on.

show 2 replies