logoalt Hacker News

HarHarVeryFunny01/21/20251 replyview on HN

No ... try claude.ai or meta.ai (both behave the same) by asking them how many r's in the (made up) word ferrybridge. They'll both get it wrong and say 2.

Now ask them to spell ferrybridge. They both get it right.

gemini.google.com still fails on "strawberry" (the other two seem to have trained on that, which is why i used a made up word instead), but can correctly break it into a letter sequence if asked.


Replies

danielmarkbruce01/21/2025

Yep, if by chance you hit a model that has seen the training data that happens to shove those tokens together in a way that it can guess, lucky you.

The point is, it would be trivial for an LLM to get it right all the time with character level tokenization. The reason LLMs using the current tokenization best tradeoff find this activity difficult is that the tokens that make up tree don't include the token for e.

show 1 reply