KV size correlates with attention parameters which are a subset of active parameters. So a typical ...

zozbot234 • yesterday at 6:46 PM • 0 replies • view on HN

KV size correlates with attention parameters which are a subset of active parameters. So a typical MoE model will have way lower KV size than a dense model of equal total parameter count.

alt Hacker News