logoalt Hacker News

comeonbro02/22/20251 replyview on HN

It's not a technical challenge in this case, it's a technical tradeoff. You could train an LLM with single characters as the atomic unit and it would be able to count the 'r's in 'strawberry' no problem. The tradeoff is that then processing the word 'strawberry' would then be 10 sequential steps, 10 complete runs through the entire LLM, where one has to finish before you can start the next one.

Instead, they're almost always trained with (what we see as, but they literally do not) multi-character tokens as the atomic unit, so 'strawberry' is spelled 'Ⰹ⧏⏃'. Processing that is only 3 sequential steps, only 3 complete runs through the entire LLM. But it needs to encounter enough relevant text in training to be able to figure out that 'Ⰹ' somehow has 1 'r' in it, '⧏' has 0 'r's, and '⏃' has 2 'r's, which really not a lot of text demonstrates, to be able to count the 'r's in 'Ⰹ⧏⏃ correctly.

The tradeoff in this is everything being 3-5x slower and more expensive (but you can count the 'r's in 'strawberry'), vs, basically only, being bad at character-level tasks like counting letters in words.

Easy choice, but leads to this stupid misundertanding being absolutely everywhere and just by itself doing an enormous amount of damage to peoples' ability to understand what is happening and about to happen.


Replies

hansmayer02/23/2025

Right so... they are still not able to spell the single letters because the algorithm we use to train it to do so is far too expensive? Wake me up when it "happens" (and it gets out of it's current, three-year long 'about to happen' phase), e.g. when it stopps costing 200B USD to do character-level tokenisation in a string, a problem we once first solved some 50-60 years ago, with higher-order programming languages. Funnily enough, those algorithms can run on an 8bit computer in negligible time and require nowhere near the resources these Frankesteins need in order to sometimes get the count of Rs in strawberries right. Provided we train them with petabytes of data, and provide gigawatts of power.

show 1 reply