Would it be possible to implement "virtual memory" for a GPU this way? Let's say you have GPUs at 30% utilization, but memory limited. Could you run 2 workloads by offloading the GPU memory when not in use?
Once you oversubscribe GPU memory, performance usually collapses. Frameworks like vLLM can explicitly offload things like the KV cache to CPU memory, but that's an application-level tradeoff, not transparent GPU virtual memory.
Once you oversubscribe GPU memory, performance usually collapses. Frameworks like vLLM can explicitly offload things like the KV cache to CPU memory, but that's an application-level tradeoff, not transparent GPU virtual memory.