The problem is you can't reliably get VMs on GCP.
All the major clouds are suffering from this. AWS you can't ever get an 80gb gpu without a long term reserve and even then it's wildly expensive. GCP you can sometimes but its also insanely expensive.
These companies claim to be "startup friendly", they are anything but. All the neo-clouds somehow manage to do this well (runpod, nebius, lambda) but the big clouds are just milking enterprise customers who won't leave and in the process screwing over the startups.
This is a massive mistake they are making, which will hurt their long term growth significantly.
We've hit into this a lot lately too, even on AWS. "Elastic" compute, but all the elasticity's gone. It's especially bitter since splitting the costs for spare capacity is the major benefit of scale here...
Agreed. Pricing is insane and availability generally sucks.
If anyone is curious about these neo-clouds, a YC startup called Shadeform has their availability and pricing in a live database here: https://www.shadeform.ai/instances
They have a platform where you can deploy VMs and bare metal from 20 or so popular ones like Lambda, Nebius, Scaleway, etc.
To massively increase the reliability to get GPUs, you can use something like SkyPilot (https://github.com/skypilot-org/skypilot) to fall back across regions, clouds, or GPU choices. E.g.,
$ sky launch --gpus H100
will fall back across GCP regions, AWS, your clusters, etc. There are options to say try either H100 or H200 or A100 or <insert>.
Essentially the way you deal with it is to increase the infra search space.