> And there's no room for kv, so you'll OOM around 4k of context.
Can't you offload KV to system RAM, or even storage? It would make it possible to run with longer contexts, even with some overhead. AIUI, local AI frameworks include support for caching some of the KV in VRAM, using a LRU policy, so the overhead would be tolerable.
The performance already isn't spectacular with it running all in vram. It'll obviously depend on the model: MoE will probably perform better than a dense model, and anything with reasoning is going to take _forever_ to even start beginning its actual output.
I know llama.cpp can, it certainly improved performance on my RAM-starved GPU.
Not worth it. It is a very significant performance hit.
With that said, people are trying to extend VRAM into system RAM or even NVMe storage, but as soon as you hit the PCI bus with the high bandwidth layers like KV cache, you eliminate a lot of the performance benefit that you get from having fast memory near the GPU die.