Multi-Head Latent attention is a redesigned attention mechanism that produces lower-dimensional KV-c...

yorwba • today at 11:45 AM • 0 replies • view on HN

Multi-Head Latent attention is a redesigned attention mechanism that produces lower-dimensional KV-cache entries. Vector quantization can store KV-cache entries using a small number of bits per dimension while ensuring that the resulting attention scores don't change too much. So MLA needs to be part of the model from the beginning of training, whereas VQ can be retrofitted afterwards, and you could also combine the two.

alt Hacker News