logoalt Hacker News

vicchenaitoday at 4:59 AM0 repliesview on HN

The leaderboard framing is clever - forces apples-to-apples comparison on a task where you can verify correctness deterministically. What I find interesting is the architectural constraints: 10-digit addition requires maintaining ~20 digits of working state across the carry chain, which is fundamentally sequential. The fact that tiny transformers can learn this at all (rather than just memorizing) suggests they are finding some form of positional carry representation in their attention patterns. Would love to see ablations on how attention head count vs depth trade off here - my intuition is that carry propagation needs depth more than width.