The root comment asked if I'd been part of an org scaling orders of magnitude quickly, so I...

dijit • yesterday at 8:14 PM • 0 replies • view on HN

The root comment asked if I'd been part of an org scaling orders of magnitude quickly, so I'll actually answer it: Venda at Christmas peak (pre-cloud, hardware on 4 month lead times, ~1% of global web traffic at peak) and The Division at launch (new IP, day-zero always-online AAA, ops team of 2). Different shapes, same playbook, both worked. So with the credentialing question out of the way..

GitHub's own April post-mortem names the causes in their own words: tight coupling allowing localised failures to cascade, and inability to shed load from misbehaving clients. Their March report says one of the March outages "shared the same underlying cause" as a February one - i.e. they hit the same rake twice in two months. Cascade isolation has a dedicated chapter in the SRE book from 2016. Load shedding is older than that; the Erlang/OTP people were writing about it in the 80s. This isn't research territory, it's a syllabus, and GitHub is fumbling it with Microsoft's chequebook behind them.

Amazon and Blizzard aren't the slam-dunk examples you want them to be either. Prime Day 2018 fell over because their auto-scaling failed and they had to manually add servers - that's not "well-known by now", that's a company at literal planetary scale getting caught short on the one day of the year it was guaranteed to matter. And Blizzard's Lord of Hatred launch this week is doing the exact same login-queue routine Diablo's done at every launch in living memory. If those are your "two decades of solved problems", the bar is on the floor.

Your 100x rearchitecture story actually argues my position, by the way. You described tight coupling causing cascading failures across services, and the fix was to decouple. That is the boring operational discipline I'm saying has atrophied - you and your team did the work. The point is GitHub, a decade later, with Microsoft's resources and thirty times the headcount, is putting out post-mortems that read like undergraduate distributed systems coursework.

So no - the question isn't whether GitHub's problem is hard. Every scaling problem looks hard from inside. The question is whether the operational discipline that solved this class of problem in the 2000s and 2010s is still being practised, or whether the industry has quietly decided "it's complicated" is sufficient cover.

alt Hacker News