> And there's no room for kv, so you'll OOM around 4k of context. Can't you offl...

zozbot234 • yesterday at 9:13 PM • 3 replies • view on HN

> And there's no room for kv, so you'll OOM around 4k of context.

Can't you offload KV to system RAM, or even storage? It would make it possible to run with longer contexts, even with some overhead. AIUI, local AI frameworks include support for caching some of the KV in VRAM, using a LRU policy, so the overhead would be tolerable.

Replies

tcdent • yesterday at 9:17 PM

Not worth it. It is a very significant performance hit.

With that said, people are trying to extend VRAM into system RAM or even NVMe storage, but as soon as you hit the PCI bus with the high bandwidth layers like KV cache, you eliminate a lot of the performance benefit that you get from having fast memory near the GPU die.

➕ show 1 reply

bastawhiz • today at 1:10 AM

The performance already isn't spectacular with it running all in vram. It'll obviously depend on the model: MoE will probably perform better than a dense model, and anything with reasoning is going to take _forever_ to even start beginning its actual output.

ranger_danger • yesterday at 9:19 PM

I know llama.cpp can, it certainly improved performance on my RAM-starved GPU.

alt Hacker News

Replies