> self-attention is efficiently computable to arbitrary precision with constant cost per token
This paper at least aspires to reproduce 'true' attention, which distinguishes it from many of the others. TBD if its successful in that.
It's like claims of room temperature superconductors or millenium prize solutions. Earth shattering if true. It'd be such a black swan. Terrible for Nvidia.
It can't be successful at that any more than 1+1 can equal 3. Fundamentally, if every token wants to be able to look at every previous token without loss of information, it must be O(n^2); N tokens looking at N tokens is quadratic. Any sub-quadratic attention must hence necessarily lose some information and be unable to support perfect recall on longer sequences.