Not true for unified systems. And for strix halo you need to dedicate the amount which is annoying.
You’re basically stating that swapping is also a bad idea. And to take it further, any memory or storage is a bad idea because there’s L1 cache/SRAM which is faster then the rest
> You’re basically stating that swapping is also a bad idea.
Is that a crazy thing to say? I can't recall the last time I was grateful for swap; it might've been before 2010.
Strix Halo’s unified setup is pretty cool. In systems with 128GB of memory, in BIOS set the dedicated GPU memory to the smallest permitted and the Drivers will use the whole main memory pool appropriately in Linux and Windows
It's not true for unified systems, because they have no secondary RAM that could be used to extend the GPU memory.
It's pretty weird to insist on a counterargument that has no implications or consequences to the presented argument.
Yes, swapping is a bad idea.
Your second argument also falls flat, because the standard CUDA hardware setup doesn't use CXL so cache coherence isn't available. You're left with manual memory synchronization. Pretending that GPUs have cache for system RAM when they don't is pretty suspect.
On some workloads, swapping is a bad idea.
The fundamental problem here is that the workload of LLMs is (vastly simplified) a repeated linear read of all the weights, in order. That is, there is no memory locality in time. There is literally anti-locality; When you read a set of weights, you know you will not need them again until you have processed everything else.
This means that many of the old approaches don't work, because time locality is such a core assumption underlying all of them. The best you can do is really a very large pool of very fast ram.
In the long term, compute is probably going to move towards the memory.