At scale the rare events start to happen reliably. Hardware failures almost certainly cause ERROR co...

oofbey • yesterday at 6:43 PM • 1 reply • view on HN

At scale the rare events start to happen reliably. Hardware failures almost certainly cause ERROR conditions. Network glitches.

Our production system pages oncall for any errors. At night it will only wake somebody up for a whole bunch of errors. This discipline forces us to take a look at every ERROR and decide if it is spurious and out of our control or something we can deal with. At some point our production system will reach a scale where there are errors logged constantly and this strategy Durant make sense any more. But for now it helps keep our system clean.

Replies

georgefrowny • today at 12:50 PM

I think if someone is going be gotten out of bed that would be a critical rather then error. Generally I'd say in a large "live" system, errors end up raising Jira tickets, criticals end up ringing phones.

alt Hacker News

Replies