The main limitation of tokenization is actually logical operations, including arithmetic. IIRC most of the poor performance of LLMs for math problems can be attributed to some very strange things that happen when you do math with tokens.
I'd like to see a math/logic bench appear for tokenization schemes that captures this. BPB/perplexity is fine, but its not everything.
Even if LLMs get better at arithmetic, they don't seem like the right tool for the job.
LLMs might never be able to crunch numbers reliably, however I expect they should be very good at identifying the right formula and the inputs for a problem ("i need the answer to x*y, where x=12938762.3 and y=902832.2332"). Then they can call a math engine (calculator or wolfram alpha or whatever) to do the actual computation. That's what humans do anyway!
It's a non-deterministic language model, shouldn't we expect mediocre performance in math? It seems like the wrong tool for the job...
Do LLMs need to be good at math with the same approach?
To draw an a analogy, we've got our human brain specialized.
Why not implement a part of the AI brain that's not neural nets, but instead circuitry specialized to math?
Maybe a dumb question since I'm a layperson!
regarding “math with tokens”: There was paper with tokenization that has specific tokens for int numbers, where token value = number. model learned to work with numbers as numbers and with tokens for everything else... it was good at math. can’t find a link, was on hugginface papers
It's not strange at all. I am playing with lambda calculus and combinatory logic now, as a base for mathematics (my interest is to understand rigorous thinking). You can express any computation using just S and K combinators, however, there is a price to that - the computations will be rather slow. So to make the computation faster, we can use additional combinators and rules to speed things up (good example is clapp() function in https://github.com/tromp/AIT/blob/master/uni.c).
Of course, the extra rules have to be logically consistent with the base S and K combinators, otherwise you will get wrong result. But if the inconsistent rule is complicated enough to be used only infrequently, you will still get correct result most of the time.
Which brings me to LLMs and transformers. I posit that transformers are essentially learned systems of rules that are applied to somewhat fuzzily known set of combinators (programs), each represented by a token (the term being represented by the embedding vector). However, the rules learned are not necessarily consistent (as it happens in the source data), so you get an occasional logical error (I don't want to call it hallucination because it's a different phenomenon from nondeterminism and extrapolation of LLMs).
This explains the collapse from the famous paper: https://ml-site.cdn-apple.com/papers/the-illusion-of-thinkin... One infrequent but inconsistent rule is enough to poison the well due to logical principle of explosion. It also clearly cannot be completely fixed with more training data.
(There is also analogy to Terry Tao's stages of mathematical thinking: https://terrytao.wordpress.com/career-advice/theres-more-to-... Pre-rigorous corresponds to soomewhat random set of likely inconsistent logical rules, rigorous to small set of obviously consistent rules, like only S and K, and post-rigorous to a large set of rules that have been vetted for consistency.)
What is the "solution" to this? Well, I think during training you somehow need to make sure that the transformer rules learned by the LLM are logically consistent for the strictly logical fragment of the human language that is relevant to logical and programming problems. Which is admittedly not an easy task (I doubt it's even possible within NN framework).
This paper has a good solution:
https://arxiv.org/abs/2402.14903
You right to left tokenize in groups of 3, so 1234567 becomes 1 234 567 rather than the default 123 456 7. And if you ensure all 1-3 digits groups are in the vocab, it does much better.
Both https://arxiv.org/abs/2503.13423 and https://arxiv.org/abs/2504.00178 (co-author) both independently noted that you can do this with just by modifying the pre-tokenization regex, without having to explicitly add commas.