logoalt Hacker News

HarHarVeryFunny01/21/20251 replyview on HN

No - you can give the LLM a list of letters and it STILL won't be able to count them reliably, so you are guessing wrong about where the difficult lies.

Try asking Claude: how many 'r's are in this list (just give me a number as your response, nothing else) : s t r a w b e r r y


Replies

danielmarkbruce01/21/2025

How many examples like that do you think it's seen? You can't given an example of something that is in effect a trick to get character level tokenization and then expect it to do well when it's seen practically zero of such data in it's training set.

Nobody who suggests methods like character or byte level 'tokenization' suggests a model trained on current tokenization schemes should be able to do what you are suggesting. They are suggesting actually train it on characters or bytes.

You say all this as though I'm suggesting something novel. I'm not. Appealing to authority is kinda lame, but maybe see Andrej's take: https://x.com/karpathy/status/1657949234535211009

show 2 replies