logoalt Hacker News

og_kalu01/22/20251 replyview on HN

Have you seen the Byte-latent Transformer paper?

It does away with sub-word tokenization but is still more or less a transformer (no working memory or internal iteration). Mostly, the (performance) gains seem modest (not unanimous, some benchmarks it's a bit worse) ....until you hit anything to do with character level manipulation and it just stomps. 1.1% to 99% on CUTE - Spelling as a particularly egregious example.

I'm not sure what the problem is exactly but clearly something about sub-word tokenization is giving these models a particularly hard time on these sort of tasks.

https://arxiv.org/abs/2412.09871


Replies

HarHarVeryFunny01/22/2025

The CUTE benchmark is interesting, but doesn't have enough examples of the actual prompts used and model outputs to be able to evaluate the results. Obviously transformers internally manipulate their input at token level granularity, so to be successful at character level manipulation they first need to generate the character level token sequence, THEN do the manipulation. Prompting them to directly output a result without allowing them to first generate the character sequence would therefore guarantee bad performance, so it'd be important to see the details.

https://arxiv.org/pdf/2409.15452