I love how this paper describes what actually happens and what the current tradeoffs are.
That having been said, many LLMs are being run on SIMD GPUs, in warps, basically they are just doing a lot of vector multiplications, activation functions and kv self attention (the expendive step).
The issue is we want the LLMs to be one-way through the layers, whereas turing-complete programming languages support loops and no well-defined stopping time. You can stick a simple computer into an LLM, but it won’t be able to do long loops.
However, for these specific workloads, the need to attend only to the latest state is indeed a huge optimization! Gone is the need for n^2 complexity that dominates the cost, now it is (log n)^2 attention which is far smaller.