On llama server, the Q4_K_M is giving about 91k context on 24GB, which calculates to about 70MB per ...

zkmon • today at 7:48 PM • 0 replies • view on HN

On llama server, the Q4_K_M is giving about 91k context on 24GB, which calculates to about 70MB per 1K context (KV-Cache). I could have gone for Q5 which probably leaves about 30K token space. I think this is pretty impressive.

alt Hacker News