I would suggest to use both on-premise hardware and cloud computing. Which is probably what comma is doing.
For critical infrastructure, I would rather pay a competent cloud provider than being responsible for reliability issues. Maintaining one server room in the headquarters is something, but two servers rooms in different locations, with resilient power and network is a bit too much effort IMHO.
For running many slurm jobs on good servers, cloud computing is very expensive and you sometimes save money in a matter of months. And who cares if the server room is a total loss after a while, worst case you write some more YAML and Terraform and deploy a temporary replacement in the cloud.
Another thing between is colocation, where you put hardware you own in a managed data center. It’s a bit old fashioned, but it may make sense in some cases.
I can also mention that research HPCs may be worth considering. In research, we have some of the world fastest computers at a fraction of the cost of cloud computing. It’s great as long as you don’t mind not being root and having to use slurm.
I don’t know in USA, but in Norway you can run your private company slurm AI workloads on research HPCs, though you will pay quite a bit more than universities and research institutions. But you can also have research projects together with universities or research institutions, and everyone will be happy if your business benefits a lot from the collaboration.
> I would rather pay a competent cloud provider than being responsible for reliability issues.
Why do so many developers and sysadmins think they're not competent for hosting services. It is a lot easier than you think, and its also fun to solve technical issues you may have.
> Maintaining one server room in the headquarters is something, but two servers rooms in different locations, with resilient power and network is a bit too much effort IMHO.
Speaking as someone who does this, it is very straightforward. You can rent space from people like Equinix or Global Switch for very reasonable prices. They then take care of power, cooling, cabling plant etc.
Unfortunately we experienced an issue where our Slurm pool was contaminated by a misconstrained Postgres Daemon. Normally the contaminated slurm pool would drain into a docker container, but due to Rust it overloaded and the daemon ate its own head. Eventually we returned it to a restful state so all's well that ends well.
(hardware engineer trying to understand wtaf software people are saying when they speak)
> but two servers rooms in different locations, with resilient power and network is a bit too much effort IMHO
I worked in a company with two server farms (a main and a a backup one essentially) in Italy located in two different regions and we had a total of 5 employees taking care of them.
We didn't hear about them, we didn't know their names, but we had almost 100% uptime and terrific performance.
There was one single person out of 40 developers who's main responsibility were deploys, and that's it.
It costed my company 800k euros per year to run both the server farms (hardware, salaries, energy), and it spared the company around 7-8M in cloud costs.
Now I work for clients that spend multiple millions in cloud for a fraction of the output and traffic, and I think employ around 15+ dev ops engineers.