Next-token-prediction cannot do calculations. That is fundamental.
It can produce outputs that resemble calculations.
It can prompt an agent to input some numbers into a separate program that will do calculations for it and then return them as a prompt.
Neither of these are calculations.
Humans can't do calculations either, by your definition. Only computers can.
So you don't think 50T parameter neural networks can encode the logic for adding two n-bit integers for reasonably sized integers? That would be pretty sad.