Can someone that's worked at one of these big companies honestly explain how it happens that wh...

eduardogarza • today at 4:23 PM • 4 replies • view on HN

Can someone that's worked at one of these big companies honestly explain how it happens that when these guys are down, it's never for like 10-15 mins ... it's always 1-2+ hours? Do they not have mechanisms in place to revert their migrations and deployments? What goes on behind the scenes during these "outages"?

Replies

aix1 • today at 4:36 PM

Part of it observability bias: longer, more widespread outages are more likely to draw signficant attention. This doesn't mean that there aren't also shorter, smaller-scope outages, it's just that we're much less likely to know about them.

For example, if there's a problem that gets caught at the 1% stage of a staged rollout, we're probably not going to find ourselves discussing it on HN.

jcfrei • today at 4:29 PM

Quick fixes have tendencies to break other stuff and just make matters worse. Better to leave it offline for a little longer, fix the definitive root issue and make sure it comes online nicely. If the issue was just a quirk in a recent deployment then these probably can be reverted easily on the endpoints where they were just deployed (I'm sure they are using staggered roll-outs). These long term downtime things are probably not issues related to a recent release.

Ocerge • today at 5:59 PM

You will run into thundering herd/hotspotting/pre-warmed caching issues when you have to restart. There's generally not an easy to way to switch these sorts of systems on and off, especially a relatively new system that isn't battle-hardened.

I got nothing for the github outages this year though, that seems like incompetence.

mrguyorama • today at 7:33 PM

Well when the coding agents go down who are they supposed to ask what the problem is?

They should probably buy subscriptions to those Chinese agents.

alt Hacker News

Replies