Feels 100% vibe coded in a bad way.
Llama.cpp already has KV compression and one of the turbo quant PRs will get merged at some point.
If you don’t care about the fancy 3 bit, the q8 KV compression is good enough! Don’t bother with q4
./build/bin/llama-server -m model.gguf \ --cache-type-k q8_0 \ --cache-type-v q8_0 \ -c 65536
Etc