logoalt Hacker News

lostmsutoday at 1:10 PM1 replyview on HN

How large is the KV cache?


Replies

xbartoday at 1:47 PM

0.1 GB per full-attention layer and "The model has 60 transformer layers: 45 GatedDeltaNet (linear attention) + 15 standard full attention." So, 1.5 GB.