They were running a big kubernetes infrastructure to handle all of these RPC calls.
That takes a lot of engineer hours to set up and maintain. This architecture didn't just happen, it took a lot of FTE hours to get it working and keep it that way.
Yeah, the situation from TFA doesn't make a lot of sense; I was just highlighting that it's not as clear-cut as "costs > 1 FTE => fix it."
Kube is trivial to run. You hit a few switches on GKE/EKS and then a few simple configs. It doesn't take very many engineer hours to run. Infrastructure these days is trivial to operate. As an example, I run a datacenter cluster myself for a micro-SaaS in the process of SOC2 Type 2 compliance. The infra itself is pretty reliable. I had to run some power-kill sims before I traveled and it came back A+. With GKE/EKS this is even easier.
Over the years of running these I think the key is to keep the cluster config manual and then you just deploy your YAMLs from a repo with hydration of secrets or whatever.