I wonder if the reason the models have problem with this is that their tokens aren't the same a...

theanirudh • 01/21/2025 • 4 replies • view on HN

I wonder if the reason the models have problem with this is that their tokens aren't the same as our characters. It's like asking someone who can speak English (but doesn't know how to read) how many R's are there in strawberry. They are fluent in English audio tokens, but not written tokens.

Replies

max51 • 01/22/2025

The way LLMs get it right by counting the letters then change their answer at the last second makes me feel like there might be a large amount of text somewhere (eg. a reddit thread) in the dataset that repeats over and over that there is the wrong number of Rs. We've seen may weird glitches like this before (eg. a specific reddit username that would crash chatgpt)

ijidak • 01/21/2025

Agree. We've given them a different alphabet than ours.

They speak a different language that captures the same meaning, but has different units.

Somehow they need to learn that their unit of thought is not the same as our speech. So that these questions need to map to a different alphabet.

That's my two cents.

➕ show 1 reply

andrewla • 01/21/2025

The amazing thing continues to be that they can ever answer these questions correctly.

It's very easy to write a paper in the style of "it is impossible for a bee to fly" for LLMs and spelling. The incompleteness of our understanding of these systems is astonishing.

➕ show 1 reply

maxrmk • 01/21/2025

Yeah that’s my understanding of the root cause. It can also cause weirdness with numbers because they aren’t tokenized one digit at a time. For good reason, but it still causes some unexpected issues.

➕ show 1 reply

alt Hacker News

Replies