It really, really depends on what you mean. Specifically, it depends on the application and i...

keeda • yesterday at 5:20 PM • 3 replies • view on HN

It really, really depends on what you mean. Specifically, it depends on the application and its various compute, I/O and access patterns. Scaling ecommerce and games is well-known by now (e.g. Amazon and Blizzard have been dealing with insane scale for two decades now.) However, anything outside a well-known pattern can be very tricky to scale.

I once worked on a team had to 100x scale a system whose downstream dependencies were various 3rd party APIs and data sources, most of which had no real SLAs to speak of and had extremely high variance in latencies and data transfer patterns. This basically required rearchitecting everything including our clients because the typical transactional request/response access pattern was too tightly coupled, and any hiccup in an external API quickly rippled up through the call-tree and caused outages 3+ services removed from ours. In some cases, the re-architecting went all the way to the UI.

Years later, I led a company-wide effort to optimize our entire user-facing application infrastructure to not fall over from sharply spiking user traffic, touching dozens of services across dozens of teams. We did a brief study and realized there was not a single common solution recommendation (like "tune your caches") we could give that would help all the teams because each one had very different resource usage patterns and hence different bottlenecks. Our approach was basically to farm the task out to each team and say "here are some common metrics to look into and some common issues to look for and some common solutions, get back to us if you need help." We spent a lot of time on the help.

I have no idea what the patterns for GitHub are, but I'll note its much more than just a DB and it has a dependency (Actions) with extremely high variance in latencies and resource usage.

Replies

keeda • yesterday at 8:28 PM

I wrote this in response to the below comment, which is now edited and unfortunately dead, so posting here:

I understand, that wasn't a comment on your efforts back then, just that it is a solved problem today. But that does not mean other scaling problems are comparable or comparably solved. The universe of scaling problems is immense!

Worse, different problems occur at different scales. In the 3rd party API system, years after the first re-architecting, some use-cases developed issues at scale that exceeded the already high operational parameters we benchmarked at, and required us to re-architect the service again, including building out a whole new cluster so we could isolate that traffic entirely.

It is really hard to predict how things will break until they do.

(As an aside, I remember reading a lot of interesting things about Blizzard's technology, even if Blizzard didn't publish those themselves. There were many people who researched their products and published their findings. For instance, someone analyzed wireshark traces and published a very detailed report about how they tuned their server-side networking stack. One thing that stood out was Blizzard used TCP for WoW, whereas the conventional wisdom was UDP for real-time multiplayer!)

➕ show 1 reply

dijit • yesterday at 8:14 PM

The root comment asked if I'd been part of an org scaling orders of magnitude quickly, so I'll actually answer it: Venda at Christmas peak (pre-cloud, hardware on 4 month lead times, ~1% of global web traffic at peak) and The Division at launch (new IP, day-zero always-online AAA, ops team of 2). Different shapes, same playbook, both worked. So with the credentialing question out of the way..

GitHub's own April post-mortem names the causes in their own words: tight coupling allowing localised failures to cascade, and inability to shed load from misbehaving clients. Their March report says one of the March outages "shared the same underlying cause" as a February one - i.e. they hit the same rake twice in two months. Cascade isolation has a dedicated chapter in the SRE book from 2016. Load shedding is older than that; the Erlang/OTP people were writing about it in the 80s. This isn't research territory, it's a syllabus, and GitHub is fumbling it with Microsoft's chequebook behind them.

Amazon and Blizzard aren't the slam-dunk examples you want them to be either. Prime Day 2018 fell over because their auto-scaling failed and they had to manually add servers - that's not "well-known by now", that's a company at literal planetary scale getting caught short on the one day of the year it was guaranteed to matter. And Blizzard's Lord of Hatred launch this week is doing the exact same login-queue routine Diablo's done at every launch in living memory. If those are your "two decades of solved problems", the bar is on the floor.

Your 100x rearchitecture story actually argues my position, by the way. You described tight coupling causing cascading failures across services, and the fix was to decouple. That is the boring operational discipline I'm saying has atrophied - you and your team did the work. The point is GitHub, a decade later, with Microsoft's resources and thirty times the headcount, is putting out post-mortems that read like undergraduate distributed systems coursework.

So no - the question isn't whether GitHub's problem is hard. Every scaling problem looks hard from inside. The question is whether the operational discipline that solved this class of problem in the 2000s and 2010s is still being practised, or whether the industry has quietly decided "it's complicated" is sufficient cover.

alt Hacker News

Replies