i am not sre, merely sysadmin.
and somehow i have this impression that gpus on slurm/pbs could not be simpler.
u can use a vm for the head node, dont even need the clustering really..if u can accept taking 20min to restore a vm.. and the rest of the hardware are homogeneous - you setup 1 right and the rest are identical.
and its a cluster with a job queue.. 1 node going down is not the end of the world..
ok if u have pcie GPUs sometimes u have to re-seat them and its a pain. otherwise if ur h200 or disks fail u just replace them, under warranty or not...
That sounds way easier than the methods I’ve had to manage GPUs in the Enterprise on-prem thus far (PCIe cards slotted into hypervisor boxes and shared via SR-IOV). I’ll have to look into it, but I doubt it’ll ever enter my personal wheelhouse given how quickly GPU-based workloads are either moved to the cloud for effective utilization at scale, or onto custom accelerators for edge workloads/inference.