My guess is that autoregressive models can use Key Value (KV) caching to eliminate most of the FLOPs...

ckjellqv • last Sunday at 6:27 PM • 0 replies • view on HN

My guess is that autoregressive models can use Key Value (KV) caching to eliminate most of the FLOPs inside the self-attention block. Can't use KV caching inside diffusion (because it's not a causal model) but they sell this as a win anyway because they believe it leads to better reasoning.

alt Hacker News