> Unlike traditional Transformer designs, which suffer from quadratic memory and computation over...

3abiton • 06/16/2025 • 0 replies • view on HN

> Unlike traditional Transformer designs, which suffer from quadratic memory and computation overload due to the nature of the self attention mechanism, our model avoids token to token attention entirely.

I skimmed the paper, and unlike transformers they basically can scale much more efficiently with longer context. While it's possible to fit 1M token, you need a significant amount of memory. Alrhough they benchmark against GPT2, so I would say quite preliminary work so far, although promising architecture.

alt Hacker News