This is 128B dense though. the K/V cache on long context is going to be massive
Don’t think kv size correlates to dense/moe
With turbo quant, you would reduce it by over 6X.
Don’t think kv size correlates to dense/moe