The LLM can't, thats what makes it relatively difficult. The tokenizer can. Run it through yo...

danielmarkbruce • 01/21/2025 • 2 replies • view on HN

The LLM can't, thats what makes it relatively difficult. The tokenizer can.

Run it through your head with character level tokenization. Imagine the attention calculations. See how easy it would be? See how few samples would be required? It's a trivial thing when the tokenizer breaks everything down to characters.

Consider the amount and specificity of training data required to learn spelling 'games' using current tokenization schemes. Vocabularies of 100,000 plus tokens, many of which are close together in high dimensional space but spelled very differently. Then consider the various data sets which give phonetic information as a method to spell. They'd be tokenized in ways which confuse a model.

Look, maybe go build one. Your head will spin once you start dealing with the various types of training data and how different tokenization changes things. It screws spelling, math, code, technical biology material, financial material. I specifically build models for financial markets and it's an issue.

Replies

HarHarVeryFunny • 01/21/2025

> I specifically build models for financial markets and it's an issue.

Well, as you can verify for yourself, LLMs can spell just fine, even if you choose to believe that they are doing so by black magic or tool use rather than learnt prediction.

So, whatever problems you are having with your financial models isn't because they can't spell.

HarHarVeryFunny • 01/21/2025

You seem to think that predicting s t -> s t is easier than predicting st (single token) -> s t.

Of all the incredible things that LLMs can do, why do you imagine that something so basic is challenging to them?

In a trillion token training set, how few examples of spelling are you thinking there are?

Given all the specialized data that is deliberately added to training sets to boost performance in specific areas, are you assuming that it might not occur to them to add coverage of token spellings if it was needed ?!

Why are you relying on what you believe to be true, rather than just firing up a bunch of models and trying it for yourself ?

➕ show 1 reply

alt Hacker News

Replies