Qwen3.5 series is a little bit of an exception to the general rule here. It is incredibly kv cache s...

zargon • yesterday at 5:58 PM • 1 reply • view on HN

Qwen3.5 series is a little bit of an exception to the general rule here. It is incredibly kv cache size efficient. I think the max context (262k) fits in 3GB at q8 iirc. I prefer to keep the cache at full precision though.

Replies

zargon • yesterday at 8:47 PM

I just tested it and have to make a correction. With llama.cpp, 262144 tokens context (Q8 cache) used 8.7 GB memory with Qwen3.6 27B. Still very impressive.

➕ show 1 reply

alt Hacker News

Replies