logoalt Hacker News

curioussquirreltoday at 4:52 PM1 replyview on HN

Claude's tokenizers have actually been getting less efficient over the years (I think we're at the third iteration at the least since Sonnet 3.5). And if you prompt the LLM in a language other than English, or if your users prompt it or generate content in other languages, the costs go higher even more. And I mean hundreds of percent more for languages with complex scripts like Tamil or Japanese. If you're interested in the research we did comparing tokenizers of several SOTA models in multiple languages, just hit me up.


Replies

arcanemachinertoday at 5:25 PM

I would encourage you to post a link here, and also to submit to HN if you haven't already. :)