logoalt Hacker News

tgrowazayyesterday at 9:23 PM1 replyview on HN

Self-hosted 8xH100 is ~$250k, depreciated across three years => $80k/year, with power and cooling => $90k/year (~$10/hour total).

AWS charges $55/hour for EC2 p5.48xlarge instance, which goes down with 1 or 3 year commitments.

With 1 year commitment, it costs ~$30/hour => $262k per year.

3-year commitment brings price down to $24/hour => $210k per year.

This price does NOT include egress, and other fees.

So, yeah, there is a $120k-$175k difference that can pay for a full-time on-site SRE, even if you only need one 8xH100 server.

Numbers get better if you need more than one server like that.


Replies

Aurornisyesterday at 9:33 PM

$120K isn't going to cover the fully loaded costs of an SRE who can set up and run that.

Hiring 1 person to run the infrastructure means that 1 person is on-call 24/7 forever.

If there's an issue with the server while they're sick or on vacation, you just stop and wait.

If they take a new job, you need to find someone to take over or very quickly hire a replacement.

There's a second bus factor: What happens when that 8xH100 starts to get flakey? You can't move the jobs to another server because you only have one. You can start diagnosing things and replacing parts and hope it gets to the root issue, but that's more downtime.

Going on-prem like this is highly risky. It works well until the hardware starts developing problems or the person in charge gets a new job. The weeks and months lost to dealing with the server start to become a problem. The SRE team starts to get tired of having to do all of their work on weekends because they can't block active use during the week. Teams start complaining that they need to use cloud to keep their project moving forward.

show 8 replies