logoalt Hacker News

Aurornisyesterday at 10:58 PM0 repliesview on HN

> This reads like you didn’t read the post.

I was discussing details I read in your repo. How did you conclude that I didn't read the post? I'm skeptical a human is writing these comments because everything you're posting reads like LLM output

> On the Q4 KV cache: the tradeoff is disclosed with actual numbers. AL 8.56 -> 8.33 at short context (3% drop), dramatically better at long context.

I'm sorry, but you're not the first (or LLM) to think of using Q4 KV cache to fit more context in VRAM.

The degradation is far more than 3% on real evals. Q8 only recently became usable on Qwen3.5 in llama.cpp with the context rotation changes. Before that bf16 was necessary to get decent performance in real tasks.

Q4 is a non-starter for real work. The fact that you're still trying to defend it tells me you haven't used this for anything other than token/sec racing.