There are so many flags to llama.ccp that I won't try to say anything too strong, but I believe...

xscott • today at 1:13 AM • 0 replies • view on HN

There are so many flags to llama.ccp that I won't try to say anything too strong, but I believe things related to `--kv-offload` mean you can have the KV cache in GPU VRAM, regular GPU RAM, paged to disk, etc...

I'm on a Mac with unified memory, so I can't easily benchmark it for you, but I think a PC with 64GB of regular RAM and a 24GB gaming card could swap between multiple sessions without too much pain. The weights could stay resident on the GPU.

On the other hand, I did just dump some Project Gutenberg texts into a prompt, and building that cache in the first place was slower than I though it would be.

alt Hacker News