> As this is a dense model and it's pretty sizable, 4-bit quantization can be nearly lossles...

Aurornis • yesterday at 3:45 PM • 3 replies • view on HN

> As this is a dense model and it's pretty sizable, 4-bit quantization can be nearly lossless

The 4-bit quants are far from lossless. The effects show up more on longer context problems.

> You can probably even go FP8 with 5090 (though there will be tradeoffs)

You cannot run these models at 8-bit on a 32GB card because you need space for context. Typically it would be Q5 on a 32GB card to fit context lengths needed for anything other than short answers.

Replies

zargon • yesterday at 9:02 PM

I just loaded up Qwen3.6 27B at Q8_0 quantization in llama.cpp, with 131072 context and Q8 kv cache:

  build/bin/llama-server \
    -m ~/models/llm/qwen3.6-27b/qwen3.6-27B-q8_0.gguf \
    --no-mmap \
    --n-gpu-layers all \
    --ctx-size 131072 \
    --flash-attn on \
    --cache-type-k q8_0 \
    --cache-type-v q8_0 \
    --jinja \
    --no-mmproj \
    --parallel 1 \
    --cache-ram 4096 -ctxcp 2 \
    --reasoning on \
    --chat-template-kwargs '{"preserve_thinking": true}'

Should fit nicely in a single 5090:

  self    model   context   compute
  30968 = 25972 +    4501 +     495

Even bumping up to 16-bit K cache should fit comfortably by dropping down to 64K context, which is still a pretty decent amount. I would try both. I'm not sure how tolerant Qwen3.5 series is of dropping K cache to 8 bits.

ekojs • yesterday at 4:04 PM

> You cannot run these models at 8-bit on a 32GB card because you need space for context

You probably can actually. Not saying that it would be ideal but it can fit entirely in VRAM (if you make sure to quantize the attention layers). KV cache quantization and not loading the vision tower would help quite a bit. Not ideal for long context, but it should be very much possible.

I addressed the lossless claim in another reply but I guess it really depends on what the model is used for. For my usecases, it's nearly lossless I'd say.

alex7o • yesterday at 6:01 PM

Turboquant on 4bit helps a lot as well for keeping context in vram, but int4 is definitely not lossless. But it all depends for some people this is sufficient

➕ show 1 reply

alt Hacker News

Replies