The best and proven linear attention is the Gated DeltaNet or variations of it, used by Kimi and Qwen. Anyone who thinks linear attention can't work is forgetting that models are a fixed size so attention should always be compressable to be linear. Another way to think of the feasibility of linear attention is that the standard attention mechanism can be made linear simply by removing the softmax so the kv cache stores the kv product as a constant size matrix instead. Softmax just normalizes attention, but it's not theoretically required.