There's a graveyard of 100s of papers with "approximate near linear time attention."

thomasahle • today at 3:33 PM • 9 replies • view on HN

They always hope the speed increase makes up for the lower quality, but it never does. The quadratic time seems inherent to the problem.

Indeed, there are lower bounds showing that sub n^2 algorithms can't work: https://arxiv.org/pdf/2302.13214

Replies

jcarreiro • today at 4:47 PM

The paper says that:

> In practice, we find that four Taylor terms (P = 4) suffice for recovering conventional attention with elementwise errors of approximately the same magnitude as Float16 resolution, acceptable for many AI applications.

ie., the claim is that this method reproduces the results of conventional attention, up to float16 numerical precision.

➕ show 3 replies

kristjansson • today at 4:15 PM

> self-attention is efficiently computable to arbitrary precision with constant cost per token

This paper at least aspires to reproduce 'true' attention, which distinguishes it from many of the others. TBD if its successful in that.

➕ show 2 replies

fheinsen • today at 3:59 PM

As the error via linear approximation approaches similar magnitude as numerical error via quadratic computation, don’t the two start becoming comparable in practice?

I ask because in practice, for inference, attention is typically computed with low-precision (4-bit, 8-bit, 16-bit) floats.

Numerical error, in fact, may be a key factor as to why quadratic attention, in practice, exhibits context rot as context gets longer, analogous to an RNN:

https://www.anthropic.com/engineering/effective-context-engi...

WhitneyLand • today at 5:53 PM

The 2023 paper even if true doesn’t preclude the 2026 paper from being true, it just sets constraints on how a faster attention solution would have to work.

cobolexpert • today at 4:05 PM

Dumb question: is the quadratic time complexity for training, inference, or both?

➕ show 2 replies

naasking • today at 4:19 PM

I think any kind of innovation here will have to take advantage of some structure inherent to the problem, like eliminating attention in favour of geometric structures like Grassman flows [1].

[1] Attention Is Not What You Need, https://arxiv.org/abs/2512.19428

➕ show 1 reply

cubefox • today at 3:43 PM

I think DeepSeek V3.2 is sub n^2, but it clearly performs quite well, refuting the alleged lower bounds in the paper.

➕ show 2 replies

clarity_hacker • today at 4:04 PM

[dead]

alt Hacker News

Replies