The bitter lesson is coming for tokenization

268 points • by todsacerdoti • yesterday at 2:14 PM • 118 comments • view on HN

Comments

The main limitation of tokenization is actually logical operations, including arithmetic. IIRC most of the poor performance of LLMs for math problems can be attributed to some very strange things that happen when you do math with tokens.

I'd like to see a math/logic bench appear for tokenization schemes that captures this. BPB/perplexity is fine, but its not everything.

➕ show 6 replies

Scene_Cast2 • yesterday at 3:24 PM

I realized that with tokenization, there's a theoretical bottleneck when predicting the next token.

Let's say that we have 15k unique tokens (going by modern open models). Let's also say that we have an embedding dimensionality of 1k. This implies that we have a maximum 1k degrees of freedom (or rank) on our output. The model is able to pick any single of the 15k tokens as the top token, but the expressivity of the _probability distribution_ is inherently limited to 1k unique linear components.

➕ show 4 replies

rryan • today at 5:38 AM

Don't make me tap the sign: There is no such thing as "bytes". There are only encodings. UTF-8 is the encoding most people are using when they talk about modeling "raw bytes" of text. UTF-8 is just a shitty (biased) human-designed tokenizer of the unicode codepoints.

qoez • yesterday at 4:50 PM

The counter argument is that the theoretical minimum is a few mcdonalds meals a day worth of energy even for the highest ranked human pure mathematician.

➕ show 1 reply

marcosdumay • yesterday at 4:31 PM

Yeah, make the network deeper.

When all you have is a hammer... It makes a lot of sense that a transformation layer that makes the tokens more semantically relevant will help optimize the entire network after it and increase the effective size of your context window. And one of the main immediate obstacle stopping those models from being intelligent is context window size.

On the other hand, the current models already cost something on the line of the median country GDP to train, and they are nowhere close to that in value. The saying that "if brute force didn't solve your problem, you didn't apply enough force" is intended to be listened as a joke.

➕ show 2 replies

pona-a • yesterday at 5:48 PM

Didn't tokenization already have one bitter lesson: that it's better to let simple statistics guide the splitting, rather than expert morphology models? Would this technically be a more bitter lesson?

➕ show 2 replies

cheesecompiler • yesterday at 3:32 PM

The reverse is possible too: throwing massive compute at a problem can mask the existence of a simpler, more general solution. General-purpose methods tend to win out over time—but how can we be sure they’re truly the most general if we commit so hard to one paradigm (e.g. LLMs) that we stop exploring the underlying structure?

➕ show 2 replies

broses • today at 1:21 AM

This gave me an idea: we can take a mixture of tokenizations with learned weights, just like taking a mixture of experts with learned weights. BLT is optimized for compression, but an approach like this could be optimized directly for model performance, and really learn to skim.

Concretely: we learn a medium sized model that takes a partial tokenization and outputs a probability distribution over the endpoints of the next token (say we let the token lengths range from 1 to 64 bytes, the model outputs 64 logits). Then we do a beam search to find the, say, 4 most likely tokenizations. Then we run the transformer on all four tokenizations, and we take the expected value of the loss to be the final loss.

If we train this on prompt-response pairs, so that it only has to learn what to say and doesn't have to predict the context, then it could learn to skim boring stuff by patching it into ~64 byte tokens. Or more if we want.

And ofc we'd use a short context byte level transformer to encode/decode tokens to vectors. Idk this idea is kinda half baked.

➕ show 1 reply

resters • yesterday at 8:02 PM

Tokenization as a form of preprocessing has the problems the authors mention. But it is also a useful way to think about data vs metadata and moving beyond text/image io into other domains. Ultimately we need symbolic representations of things, sure they are all ultimately bytes which the model could learn to self-organize, but things like that can be useful when humans interact with the data directly, in a sense, tokens make more aspects of LLM internals "human readable", and models should also be able to learn to overcome the limitations of a particular tokenization scheme.

andy99 • yesterday at 5:11 PM

> inability to detect the number of r's in:strawberry: meme

Can someone (who know about LLMs) explain why the r's in strawberry thing is related to tokenization? I have no reason to believe an LLM would be better at counting letters if each was one token. It's not like they "see" any of it. Are they better at counting tokens than letters for some reason? Or is this just one of those things someone misinformed said to sound smart to even less informed people, that got picked up?

➕ show 6 replies

fooker • yesterday at 10:58 PM

‘Bytes’ is tokenization.

There’s no reason to assume it’s the best solution. It might be the case that a better tokenization scheme is needed for math, reasoning, video, etc models.

blixt • yesterday at 7:49 PM

I’m starting to think “The Bitter Lesson” is a clever sounding way to give shade to people that failed to nail it on their first attempt. Usually engineers build much more technology than they actually end up needing, then the extras shed off with time and experience (and often you end up building it again from scratch). It’s not clear to me that starting with “just build something that scales with compute” would get you closer to the perfect solution, even if as you get closer to it you do indeed make it possible to throw more compute at it.

That said the hand coded nature of tokenization certainly seems in dire need of a better solution, something that can be learned end to end. And It looks like we are getting closer with every iteration.

➕ show 3 replies

kgeist • yesterday at 11:10 PM

>From a domain point of view, some are skeptical that bytes are adequate for modelling natural language

If I remember correctly, GPT3.5's tokenizer treated Cyrillic as individual characters, and GPT3.5 was pretty good at Russian.

➕ show 1 reply

perching_aix • yesterday at 7:51 PM

Can't wait for models to struggle with adhering to UTF-8.

ofou • today at 7:03 AM

This is stupid because a UTF8 is a tokenizer that covers all Unicode with a vocab of only 256 (yes, without a K). This is the only way of scaling the bitter lesson with tokenizers. Also, with architectures that span +1M context windows, it’s no longer an argument/issue the reduced context windows.

hoseja • today at 7:29 AM

There are no r's in strawberry, there are two ɹ's and several dozen achenes. It's a member of Rosaceae by any other name. Fighting with nonsensical english orthography seems kinda pointless to me. Stop trying to make intelligent entities composed of written text.

curtisszmania • today at 2:58 AM

[dead]

citizenpaul • yesterday at 6:14 PM

The best general argument I've heard against the bitter lesson is. If the bitter lesson is true? How come we spend so many million man hours a year of tweaking and optimizing software systems all day long? Surely its easier and cheaper to just buy a rack of servers.

Maybe if you have infinite compute you don't worry about software design. Meanwhile in the real world...

Not only that but where did all these compute optimized solutions come from? Oh yeah millions of man hours of optimizing and testing algorithmic solutions. So unless you are some head in the clouds tenured professor just keep on doing your optimizations and job as usual.

➕ show 2 replies

alt Hacker News

The bitter lesson is coming for tokenization

Comments