CUDA has had managed memory that pages between VRAM and system RAM for a decade. Problem is doing so is unusably slow for AI purposes. Seems like an unnecessary layer here.
That slowness is almost useful. It makes the failure mode obvious instead of letting a 'transparent' layer hide it until some sloppy alloc or tensor blowup starts paging through system RAM or NVMe and the whole job turns into a smoke test for your storage stack.
For actual training, explicit sharding and RAM mapping are ugly, but at least you can see where the pressure is and reason about it. 'Transparent' often just means performance falls off a cliff and now debugging it sucks.
That slowness is almost useful. It makes the failure mode obvious instead of letting a 'transparent' layer hide it until some sloppy alloc or tensor blowup starts paging through system RAM or NVMe and the whole job turns into a smoke test for your storage stack.
For actual training, explicit sharding and RAM mapping are ugly, but at least you can see where the pressure is and reason about it. 'Transparent' often just means performance falls off a cliff and now debugging it sucks.