logoalt Hacker News

Havocyesterday at 6:30 PM1 replyview on HN

Don’t think kv size correlates to dense/moe


Replies

zozbot234yesterday at 6:46 PM

KV size correlates with attention parameters which are a subset of active parameters. So a typical MoE model will have way lower KV size than a dense model of equal total parameter count.