> With that said, people are trying to extend VRAM into system RAM or even NVMe storage
Only useful for prefill (given the usual discrete-GPU setup; iGPU/APU/unified memory is different and can basically be treated as VRAM-only, though a bit slower) since the PCIe bus becomes a severe bottleneck otherwise as soon as you offload more than a tiny fraction of the memory workload to system memory/NVMe. For decode, you're better off running entire layers (including expert layers) on the CPU, which local AI frameworks support out of the box. (CPU-run layers can in turn offload to storage for model parameters/KV cache as a last resort. But if you offload too much to storage (insufficient RAM cache) that then dominates the overhead and basically everything else becomes irrelevant.)"