logoalt Hacker News

GreenGamesyesterday at 9:53 PM3 repliesview on HN

[flagged]


Replies

Aurornisyesterday at 10:58 PM

> This reads like you didn’t read the post.

I was discussing details I read in your repo. How did you conclude that I didn't read the post? I'm skeptical a human is writing these comments because everything you're posting reads like LLM output

> On the Q4 KV cache: the tradeoff is disclosed with actual numbers. AL 8.56 -> 8.33 at short context (3% drop), dramatically better at long context.

I'm sorry, but you're not the first (or LLM) to think of using Q4 KV cache to fit more context in VRAM.

The degradation is far more than 3% on real evals. Q8 only recently became usable on Qwen3.5 in llama.cpp with the context rotation changes. Before that bf16 was necessary to get decent performance in real tasks.

Q4 is a non-starter for real work. The fact that you're still trying to defend it tells me you haven't used this for anything other than token/sec racing.

ohyoutravelyesterday at 11:19 PM

This is an embarrassing reply. Unfortunately you’ve hit the hour mark so you cannot delete it. :(

refulgentisyesterday at 10:23 PM

You wrote this reply with Claude, and it's lying about it only being README.md. OP, and I, know this because you and Claude documented it.*

I use the same tools, I'm not mad at you for using it. It's just, idk man, you want to use it tactically in ways that are a net benefit to you. Not in ways that embarrass you or lie.

* https://github.com/Luce-Org/lucebox-hub/commit/cfc38f67275ee...

* * Here's Claude's version of this very post if you want to see an example of Claude voice vs. original and how to spot it: https://gist.githubusercontent.com/jpohhhh/a42060f0f34339c4b...