Very cool, but can I suggest the `add` CPU instruction instead? Supports 64-bit numbers, and it's encoded in hardware, and no need to cross a PCIe interface into a beefy, power-hungry GPU and back again. And chances are it's cross-platform, because basically every ISA since the very first has had `add`.
I mean, yeah, no need to put a bunch of high powered cars in a circular track to watch them race really close to each other at incredible speeds, causing various hazards, either. Especially since city buses have been around for ages.
"smallest supercomputing cluster that can add two numbers"
No. You cannot. It's the wrong tool for the problem.
That little "add" of yours has the overhead of: having an LLM emit it as a tool call, having to pause the LLM inference while waiting for it to resolve, then having to encode the result as a token to feed it back.
At the same time, a "transformer-native" addition circuit? Can be executed within a single forward pass at a trivial cost, generate transformer-native representations, operate both in prefill and in autoregressive generation, and more. It's cheaper.