> we investigate the fragility of mathematical reasoning in these models and demonstrate that their performance significantly deteriorates as the number of clauses in a question increases. We hypothesize that this decline is due to the fact that current LLMs are not capable of genuine logical reasoning
I'd offer a simpler explanation: Tokenization.
If you tokenize "12345 * 27271" you will get the following:
"123", "45", " *", " ", "272", "71"
The statistical likelihood that any of these tokens predicts any of the others is completely meaningless in the context of simple arithmetic.You can argue that this is where tool use comes in (and I would be inclined to agree), but I don't think this bodes well for "genuine logical reasoning".
I respectfully disagree.
While tokenization certainly plays a role in how language models process input, it's simplistic to attribute the challenges in mathematical reasoning solely to tokenization.
SOTA language models don't just rely on individual token predictions, but build up contextual representations across multiple layers. This allows them to capture higher-level meaning beyond simple token-to-token relationships. If this weren’t the case, it would be inconceivable that models would work at all in all but the most utterly simplistic scenarios.
The decline in performance as complexity increases might be due to other factors, such as:
- Limitations in working memory or attention span - Difficulty in maintaining coherence over longer sequences - Challenges in managing multiple interdependent logical constraints simultaneously (simply due to the KQV matrices being too small)
And in any case, I think OpenAI’s o1 models are crushing it in math right now. The iterative, model-guided CoT approach seems to be able to handle very complex problems.
Wouldn't a slight change in tokenization? (say mapping single digits to single tokens) help with this specific challenge?
The llm will know 123 and 45 is a contiguious number just like how humans can tell if you say 123 and then a slight pause 45 as a single number
Nanda, et al. successfully recovered the exact mechanism through which a transformer learned to carry out modular addition. [0] Transformers are all about the training data, and we will increasingly learn that structuring the order in which data is learned matters a lot. But it's clear that transformers are absolutely capable of encoding generalized solutions to arithmetic.
Given the right tokenization scheme and training regimen, we can absolutely create LLMs which have statistically sound arithmetic capabilities. I still wouldn't trust a stochastic model over the algorithmic certainty of a calculator, but what's more important for mathematicians is that these models can reason about complex problems and help them break new ground on hard mathematical problems by leveraging the full statistical power of their weights.
[0] https://arxiv.org/abs/2301.05217