The model uses Gated DeltaNet and Gated Attention so the memory usage of the KV cache is very low, e...

GaggiX • yesterday at 4:57 PM • 0 replies • view on HN

The model uses Gated DeltaNet and Gated Attention so the memory usage of the KV cache is very low, even at BF16 precision.

alt Hacker News