> Serious question, have you been part of an org that had to scale orders of magnitude very quick...

dijit • today at 1:26 PM • 2 replies • view on HN

> Serious question, have you been part of an org that had to scale orders of magnitude very quickly?

I have, but it depends what you mean.

Scenario 1: e-commerce SaaS (think: Amazon but whitelabel, and before CPUs even had AES instructions); Christmas was "fun".

Scenario 2: Video Games. The first day is the worst day when it comes to scale. Everything has to be flawless from day 0 and you get no warning as to what can go wrong.

Yet, somehow, I managed to make highly reliable systems.

In scenario 1; I had an existing system that had to scale up and down with load, this was before there was cloud and hardware had a 3-4 month lead time, so most of the effort was around optimising existing code, increasing job timeouts and "quenching" sources that were expensive. We used to also do so 'magic' when it came to serving requests that had session token or shopping cart cookie.

In scenario 2; we have a clean-room implementation and no legacy, which is a blessing but also a curse, there's no possibility to sample real usage: but you also don't need to worry about making breaking changes that are for the better. With legacy you have to figure out how to migrate to the new behaviour gradually.

So, pro's and con's... but it's not like handling huge load hasn't been done before, computers are faster than they ever have been and while my personal opinion is that operational knowledge is dying (due to general distain for people who actually used to run systems that scale: not just write hopeful "eventually consistent" yaml that they call deterministic) - the systems that do exist today hold your hand much better than they did for me 20 years ago.

And I ran 1% of web traffic with an ops team of 5 back then. So, idk what's going on here.

EDIT: Likely people are flagging me because I sound arrogant (or I hurt their feelings by talking bad about YAML-ops), but all I am doing is answering the question presented based on my experience.

Replies

keeda • today at 5:20 PM

It really, really depends on what you mean. Specifically, it depends on the application and its various compute, I/O and access patterns. Scaling ecommerce and games is well-known by now (e.g. Amazon and Blizzard have been dealing with insane scale for two decades now.) However, anything outside a well-known pattern can be very tricky to scale.

I once worked on a team had to 100x scale a system whose downstream dependencies were various 3rd party APIs and data sources, most of which had no real SLAs to speak of and had extremely high variance in latencies and data transfer patterns. This basically required rearchitecting everything including our clients because the typical transactional request/response access pattern was too tightly coupled, and any hiccup in an external API quickly rippled up through the call-tree and caused outages 3+ services removed from ours. In some cases, the re-architecting went all the way to the UI.

Years later, I led a company-wide effort to optimize our entire user-facing application infrastructure to not fall over from sharply spiking user traffic, touching dozens of services across dozens of teams. We did a brief study and realized there was not a single common solution recommendation (like "tune your caches") we could give that would help all the teams because each one had very different resource usage patterns and hence different bottlenecks. Our approach was basically to farm the task out to each team and say "here are some common metrics to look into and some common issues to look for and some common solutions, get back to us if you need help." We spent a lot of time on the help.

I have no idea what the patterns for GitHub are, but I'll note its much more than just a DB and it has a dependency (Actions) with extremely high variance in latencies and resource usage.

➕ show 3 replies

Dwedit • today at 2:08 PM

I think you meant "green fields" and not "clean room"? Clean room refers to reverse engineering an existing program to create specifications, then having another team implement the specifications without legal risk from involving the original.

➕ show 1 reply

alt Hacker News

Replies