logoalt Hacker News

danielmarkbruce01/21/20252 repliesview on HN

How many examples like that do you think it's seen? You can't given an example of something that is in effect a trick to get character level tokenization and then expect it to do well when it's seen practically zero of such data in it's training set.

Nobody who suggests methods like character or byte level 'tokenization' suggests a model trained on current tokenization schemes should be able to do what you are suggesting. They are suggesting actually train it on characters or bytes.

You say all this as though I'm suggesting something novel. I'm not. Appealing to authority is kinda lame, but maybe see Andrej's take: https://x.com/karpathy/status/1657949234535211009


Replies

HarHarVeryFunny01/21/2025

So, one final appeal to logic from me here:

1) You must have tested and realized that these models can spell just fine - break a word into a letter sequence, regardless of how you believe they are doing it

2) As shown above, even when presented with a word already broken into a sequence of letters, the model STILL fails to always correctly count the number of a given letter. You can argue about WHY they fail (different discussion), but regardless they do (if only allowed to output a number).

Now, "how many r's in strawberry", unless memorized, is accomplished by breaking it into a sequence of letters (which it can do fine), then counting the letters in the sequence (which it fails at).

So, you're still sticking to your belief that creating the letter sequence (which it can do fine) is the problem ?!!

Rhetorical question.

HarHarVeryFunny01/21/2025

Tasks like reversing a list (Karpathy) or counting categories within in are far harder than simple prediction - the one thing LLMs are built to do.

Try it for yourself. Try it on a local model if you are paranoid that the cloud model is using a tool behind your back.