logoalt Hacker News

echelontoday at 12:41 PM8 repliesview on HN

In a high performance service with good maintenance and upkeep, you page for all 500s. A noisy pager forces the team to fix the 500s.

Maybe the Github Actions infrastructure isn't run like that.

edit: my oncall rotation notified on all 500s, 24/7, not just rates - https://news.ycombinator.com/item?id=48279262


Replies

Doohickey-dtoday at 12:53 PM

Im curious about this: because in my experience (working on smaller services though), a small number of errors is always there, as a "baseline".

Recently there was this: https://news.ycombinator.com/item?id=47252971 "10% of Firefox crashes are caused by bitflips"

Which makes me think a small amount of random issues which happen even though nothing is broken, is normal everywhere. Especially once move things around on a network, there's potential for a lot more random errors.

show 2 replies
TheDongtoday at 12:49 PM

Do you know of a single service at a single company that actually does that?

I know all of Gmail, every GCE service I can think of, every AWS service I can think of, Amazon.com, Netflix, and Github all do not page on just a single 500.

I know none of those are particularly "high performance" though. Curious where your experience is coming from.

show 3 replies
compumiketoday at 2:00 PM

Re: "page for all 500s": there's a world of difference between "page me with a critical alert at 3am" and "notify me on Monday morning when my normal workday starts". At the extremes:

If my DB health check endpoint is returning 500s for N consecutive checks over M minutes, yeah, please wake me up at 3am!

If one user hit a weird edge case in form validation and got a one-off 500, please don't! We can fix that on Monday.

Not always easy to distinguish those clearly or configure those business hours rules, but for my team at https://heyoncall.com/ that is the goal -- otherwise your team burns out fast. Waking up someone at 3am has a real cost, so you better be sure it's worth it.

show 1 reply
awithrowtoday at 12:53 PM

that is absolutely not the case for any system of size and scale. that would just burn out the on-call team and not result in improvements. Error rates/budgets are used instead.

show 1 reply
hvb2today at 2:02 PM

> A noisy pager forces the team to fix the 500s.

I'm sure you're not in ops. Or in a dev org of a service with decent request rates.

What you're asking for is a service to fail silently. There's no way a service with a decent request rate to have 0 500s. Not when it still sees development.

A 50 year old bank API? Maybe...

rhyperiortoday at 2:06 PM

You only do this when you’re trying to use incident management as a hammer to make a point to somebody whom you have otherwise failed to convince to fix something through persuasive argument. Ie, it’s punitive.

swiftcodertoday at 1:54 PM

Yeah, no, nobody runs cloud services like that. At AWS most alarms required failures in 3 consecutive 5 minute periods. Critical things could be on 3 consecutive 1 minute windows - but that alarm starts a 15 minute escalation for the oncall engineer to check in, and they have to validate the issue isn't a false alarm before updating the status page would even be considered

jordemorttoday at 1:10 PM

forget it, Jake; it’s Azure