logoalt Hacker News

ben_syesterday at 8:02 AM0 repliesview on HN

Once you oversubscribe GPU memory, performance usually collapses. Frameworks like vLLM can explicitly offload things like the KV cache to CPU memory, but that's an application-level tradeoff, not transparent GPU virtual memory.