logoalt Hacker News

Smallest transformer that can add two 10-digit numbers

128 pointsby ks2048last Thursday at 6:29 PM51 commentsview on HN

Comments

alexlitztoday at 2:44 AM

I made a blogpost on my submission (currently the top handwritten one at 36 parameters) https://alexlitzenberger.com/blog/building_a_minimal_transfo...

reerdnatoday at 6:22 AM

I couldn't help but laugh out loud at the notion of a "held-out test set" for addition of 10-digit numbers.

prng2021today at 6:02 AM

How is anyone predicting timelines for AGI when these systems can’t do basic addition of 2 arbitrary numbers with 100% accuracy?

show 2 replies
ameliustoday at 12:33 AM

> In short: if you can swap in a different set of weights and use the exact same inference code for a different task, your setup is legitimate. If the inference code is inseparable from the algorithm, it's not.

I wonder why they don't just write the code themselves, so by design the focus can be on the model.

delta_p_delta_xtoday at 3:10 AM

Very cool, but can I suggest the `add` CPU instruction instead? Supports 64-bit numbers, and it's encoded in hardware, and no need to cross a PCIe interface into a beefy, power-hungry GPU and back again. And chances are it's cross-platform, because basically every ISA since the very first has had `add`.

show 3 replies
vicchenaitoday at 4:59 AM

The leaderboard framing is clever - forces apples-to-apples comparison on a task where you can verify correctness deterministically. What I find interesting is the architectural constraints: 10-digit addition requires maintaining ~20 digits of working state across the carry chain, which is fundamentally sequential. The fact that tiny transformers can learn this at all (rather than just memorizing) suggests they are finding some form of positional carry representation in their attention patterns. Would love to see ablations on how attention head count vs depth trade off here - my intuition is that carry propagation needs depth more than width.

E-Reverancetoday at 1:27 AM

Not sure how much this fits into the rules but I saw on twitter someone claimed 28 params : https://gist.github.com/SeuperHakkerJa/da3050739bea97aabd86e...

cantalopestoday at 6:02 AM

Interesting, is this just a fun competition or would this also have some practical applications i wonder?

medi8rtoday at 12:59 AM

You can do that in a single matmul of course.

show 2 replies
i000today at 2:11 AM

Would it make sense to embed such single-purpose network with fixed weights within a LLM before pre-training?

ks2048today at 1:30 AM

So, hand-coded weights can do it with 36 params and 311 for trained weights - did anyone try the former architecture, but starting with random weights and learning?

show 1 reply
nextlevelwizardtoday at 5:51 AM

Here: eval()

You are welcome

munrotoday at 2:10 AM

>=99% accuracy wtf?!?

I was initially excited until i saw that, because it would reveal some sort of required local min capacity, and then further revelation that this was all vibe coded and no arXiv, makes me feel I should save my attn for another article.

1over137today at 2:04 AM

Now wrap it all in an Electron app!

computersucktoday at 3:48 AM

this is the dumbest fking thing to do math with

MarcLoretoday at 2:10 AM

[dead]

MarcLoretoday at 4:39 AM

[dead]

jaunt7632today at 2:21 AM

[dead]

Sophiratoday at 2:51 AM

I get that this is technically interesting, for certain, but the sheer amount of energy and associated global warming risk needed to do something with >=99% accuracy that we've been able to do easily for decades with a guaranteed 100% accuracy seems to me to be wasteful to the extreme.

show 6 replies