> There's a second bus factor: What happens when that 8xH100 starts to get flakey? You can't move the jobs to another server because you only have one.
You can still use cloud for excess capacity when needed. E.g. use on-prem for base load, and spin up cloud instances for peaks in load.
This is my favorite use of the public cloud: the modern-day “hot site”. It’s way cheaper to just pay reserved rates for failover instances of critical infra than a whole other unused site, assuming your particular compliance or regulatory frameworks allow it. Especially in an era of remote work, it’s highly practical and cost-effective.