> Our work enables unbounded token generation at modest fixed cost, substantially reducing the infrastructure and energy demands of large-scale Transformer models. The mathematical techniques we introduce are of independent interest.
Now this is a very interesting paper, which hopefully should address the chronic inefficiencies of the AI lack of efficient methods and approaches in reducing their significant computational and energy demands which are off the charts.
> These factors penalize performance relative to what a fused, hardware-optimized implementation could achieve, and the reported runtime results should therefore be interpreted conservatively.
It's still early with several limitations, but the need for wasting billions on GPUs will begin to not make any sense soon.