> If your database goes down at 3 AM, you need to fix it. Of all the places I've worked th...

ipsento606 • yesterday at 3:55 PM • 3 replies • view on HN

> If your database goes down at 3 AM, you need to fix it.

Of all the places I've worked that had the attitude "If this goes down at 3AM, we need to fix it immediately", there was only one where that was actually justifiable from a business perspective. I'm worked at plenty of places that had this attitude despite the fact that overnight traffic was minimal and nothing bad actually happened if a few clients had to wait until business hours for a fix.

I wonder if some of the preference for big-name cloud infrastructure comes from the fact that during an outage, employees can just say "AWS (or whatever) is having an outage, there's nothing we can do" vs. being expected to actually fix it

From this perspective, the ability to fix problems more quickly when self hosting could be considered an antifeature from the perspective of the employee getting woken up at 3am

Replies

laz • yesterday at 4:38 PM

The worst SEV calls are the one where you twiddle your thumbs waiting for a support rep to drop a crumb of information about the provider outage.

You wake up. It's not your fault. You're helpless to solve it.

➕ show 1 reply

jonahx • yesterday at 3:58 PM

This is also the basis for most SaaS purchases by large corporations. The old "Nobody gets fired for choosing IBM."

zbentley • yesterday at 4:09 PM

Really? That might be an anecdote sampled from unusually small businesses, then. Between myself and most peers I’ve ever talked to about availability, I heard an overwhelming majority of folks describe systems that really did need to be up 24/7 with high availability, and thus needed fast 24/7 incident response.

That includes big and small businesses, SaaS and non-SaaS, high scale (5M+rps) to tiny scale (100s-10krps), and all sorts of different markets and user bases. Even at the companies that were not staffed or providing a user service over night, overnight outages were immediately noticed because on average, more than one external integration/backfill/migration job was running at any time. Sure, “overnight on call” at small places like that was more “reports are hardcoded to email Bob if they hit an exception, and integration customers either know Bob’s phone number or how to ask their operations contact to call Bob”, but those are still environments where off-hours uptime and fast resolution of incidents was expected.

Between me, my colleagues, and friends/peers whose stories I know, that’s an N of high dozens to low hundreds.

What am I missing?

➕ show 1 reply

alt Hacker News

Replies