logoalt Hacker News

8notetoday at 6:27 PM1 replyview on HN

> When understood in this way, it is obvious why LLMs are bad at counting r's in "strawberries".

no it doesnt. it makes sense that they cant count the rs because they dont have access to the actual word, only tokens that might represent parts or the whole of the word


Replies

orbital-decaytoday at 6:59 PM

Tokenization is a simplistic explanation which is likely wrong, at least in part. They're perfectly fine reciting words character by character, using different tokenization strategies for the same word if forced to (e.g. replacing the starting space or breaking words up into basic character tokens), complex word formation in languages that heavily depend on it, etc. LLMs work with concepts rather than tokens.