logoalt Hacker News

petutoday at 8:06 PM1 replyview on HN

Try 26B first. 31B seems to have very heavy KV cache (maybe bugged in llama.cpp at the moment; 16K takes up 4.9GB).

edit: 31B cache is not bugged, there's static SWA cost of 3.6GB.. so IQ4_XS at 15.2GB seems like reasonable pair, but even then barely enough for 64K for 24GB VRAM. Maybe 8 bit KV quantization is fine now after https://github.com/ggml-org/llama.cpp/pull/21038 got merged, so 100K+ is possible.

> I should pick a full precision smaller model or 4 bit larger model?

4 bit larger model. You have to use quant either way -- even if by full precision you mean 8 bit, it's gonna be 26GB + overhead + chat context.

Try UD-Q4_K_XL.


Replies

danielhanchentoday at 8:12 PM

Yes UD-Q4_K_XL works well! :)

show 1 reply