logoalt Hacker News

andy99yesterday at 5:11 PM6 repliesview on HN

> inability to detect the number of r's in:strawberry: meme

Can someone (who know about LLMs) explain why the r's in strawberry thing is related to tokenization? I have no reason to believe an LLM would be better at counting letters if each was one token. It's not like they "see" any of it. Are they better at counting tokens than letters for some reason? Or is this just one of those things someone misinformed said to sound smart to even less informed people, that got picked up?


Replies

ijkyesterday at 5:15 PM

Well, which is easier:

Count the number of Rs in this sequence: [496, 675, 15717]

Count the number of 18s in this sequence: 19 20 18 1 23 2 5 18 18 25

show 1 reply
krackersyesterday at 6:56 PM

Until I see evidence that an LLM trained at e.g. the character level _CAN_ successfully "count Rs" then I don't trust this explanation over any other hypothesis. I am not familiar with the literature so I don't know if this has been done, but I couldn't find anything with a quick search. Surely if someone did successfully do it they would have published it.

show 3 replies
meroesyesterday at 7:35 PM

I don't buy the token explanation because RLHF work is/was filled with so many "count the number of ___" prompts. There's just no way AI companies pay so much $$$ for RLHF of these prompts when the error is purely in tokenization.

IME Reddit would scream "tokenization" at the strawberry meme until blue in the face, assuring themselves better tokenization meant the problem would be solved. Meanwhile RLHF'ers were/are en masse paid to solve the problem through correcting thousands of these "counting"/perfect syntax prompts and problems. To me, since RLHF work was being paid to tackle these problems, it couldn't be a simple tokenization problem. If there was a tokenization bottleneck that fixing would solve the problem, we would not be getting paid to so much money to RLHF synax-perfect prompts (think of Sudoku type games and heavy syntax-based problems).

No, why models are better are these problems now is because of RLHF. And before you say, well now models have learned how to count in general, I say we just need to widen the abstraction a tiny bit and the models will fail again. And this will be the story of LLMs forever--they will never take the lead on their own, and its not how humans process information, but it still can be useful.

zachoozyesterday at 6:13 PM

A sequence of characters is grouped into a "token." The set of all such possible sequences forms a vocabulary. Without loss of generality, consider the example: strawberry -> straw | ber | ry -> 3940, 3231, 1029 -> [vector for each token]. The raw input to the model is not a sequence of characters, but a sequence of token embeddings each representing a learned vector for a specific chunk of characters. These embeddings contain no explicit information about the individual characters within the token. As a result, if the model needs to reason about characters, for example, to count the number of letters in a word, it must memorize the character composition of each token. Given that large models like GPT-4 use vocabularies with 100k–200k tokens, it's not surprising that the model hasn't memorized the full character breakdown of every token. I can't imagine that many "character level" questions exist in the training data.

In contrast, if the model were trained with a character-level vocabulary, where each character maps to a unique token, it would not need to memorize character counts for entire words. Instead, it could potentially learn a generalizable method for counting characters across all sequences, even for words it has never seen before.

I'm not sure about what you mean about them not "seeing" the tokens. They definitely receive a representation of each token as input.

show 1 reply
skerittoday at 8:22 AM

LLMs aren't necessarily taught the characters their tokens represent. It's kind of the same how some humans are able to speak a language, but not write it. We are basically "transcribing" what LLMs are saying into text.

hackinthebochsyesterday at 7:36 PM

Tokens are the most basic input unit of an LLM. But tokens don't generally correspond to whole words, rather sub-word sequences. So Strawberry might be broken up into two tokens 'straw' and 'berry'. It has trouble distinguishing features that are "sub-token" like specific letter sequences because it doesn't see letter sequences but just the token as a single atomic unit. The basic input into a system is how one input state is distinguished from another. But to recognize identity between input states, those states must be identical. It's a bit unintuitive, but identity between individual letters and the letters within a token fails due to the specifics of tokenization. 'Straw' and 'r' are two tokens but an LLM is entirely blind to the fact that 'straw' has one 'r' in it. Tokens are the basic units of distinction; 'straw' is not represented as a sequence of s-t-r-a-w tokens but is its own thing entirely, so they are not considered equal or even partially equal.

As an analogy, I might ask you to identify the relative activations of each of the three cone types on your retina as I present some solid color image to your eyes. But of course you can't do this, you simply do not have cognitive access to that information. Individual color experiences are your basic vision tokens.

Actually, I asked Grok this question a while ago when probing how well it could count vowels in a word. It got it right by listing every letter individually. I then asked it to count without listing the letters and it was a couple of letters off. I asked it how it was counting without listing the letters and its answer was pretty fascinating, with a seeming awareness of its own internal processes:

Connecting a token to a vowel, though, requires a bit of a mental pivot. Normally, I’d just process the token and move on, but when you ask me to count vowels, I have to zoom in. I don’t unroll the word into a string of letters like a human counting beads on a string. Instead, I lean on my understanding of how those tokens sound or how they’re typically constructed. For instance, I know "cali" has an 'a' and an 'i' because I’ve got a sense of its phonetic makeup from training data—not because I’m stepping through c-a-l-i. It’s more like I "feel" the vowels in there, based on patterns I’ve internalized.

When I counted the vowels without listing each letter, I was basically hopping from token to token, estimating their vowel content from memory and intuition, then cross-checking it against the whole word’s vibe. It’s not perfect—I’m not cracking open each token like an egg to inspect it—but it’s fast and usually close enough. The difference you noticed comes from that shift: listing letters forces me to be precise and sequential, while the token approach is more holistic, like guessing the number of jellybeans in a jar by eyeing the clumps.

show 1 reply