logoalt Hacker News

eduardogarzatoday at 4:23 PM4 repliesview on HN

Can someone that's worked at one of these big companies honestly explain how it happens that when these guys are down, it's never for like 10-15 mins ... it's always 1-2+ hours? Do they not have mechanisms in place to revert their migrations and deployments? What goes on behind the scenes during these "outages"?


Replies

aix1today at 4:36 PM

Part of it observability bias: longer, more widespread outages are more likely to draw signficant attention. This doesn't mean that there aren't also shorter, smaller-scope outages, it's just that we're much less likely to know about them.

For example, if there's a problem that gets caught at the 1% stage of a staged rollout, we're probably not going to find ourselves discussing it on HN.

jcfreitoday at 4:29 PM

Quick fixes have tendencies to break other stuff and just make matters worse. Better to leave it offline for a little longer, fix the definitive root issue and make sure it comes online nicely. If the issue was just a quirk in a recent deployment then these probably can be reverted easily on the endpoints where they were just deployed (I'm sure they are using staggered roll-outs). These long term downtime things are probably not issues related to a recent release.

Ocergetoday at 5:59 PM

You will run into thundering herd/hotspotting/pre-warmed caching issues when you have to restart. There's generally not an easy to way to switch these sorts of systems on and off, especially a relatively new system that isn't battle-hardened.

I got nothing for the github outages this year though, that seems like incompetence.

mrguyoramatoday at 7:33 PM

Well when the coding agents go down who are they supposed to ask what the problem is?

They should probably buy subscriptions to those Chinese agents.