logoalt Hacker News

baalimagoyesterday at 6:28 PM1 replyview on HN

My solution to this is to have leveled alerts. Some are... recommendations, the ones which you look at with a glance to get a heads up about something being wrong. These are the ones which OP would claim cause alert fatigue, most likely.

Then I have a second level of this, the superpanic. Here is the "true" alert, which means "drop all things, fix this now". On every superpanic, there are stricter routines which intentionally cause friction, such as creating tickets about said superpanic, potentially hosting post mortems etc. This additional manual labour encourages tweaking the levels of the superpanic so that they sometimes are more lack, sometimes stricter, depending on the quality of the deployed services + the current load.

What signals a superpanic? Key valuable functionality being offline. Off-site uptime-checkers assuring that all primary domains resolve + serve traffic, mostly. Also crontime integration tests of core functionality. Stuff like that.


Replies

mystifyingpoiyesterday at 7:09 PM

> there are stricter routines which intentionally cause friction, such as creating tickets

While this sounds sensible, in my experience it often becomes just a convoluted punishment for people involved in the alert firing. In general, people are lazy (sorry), and if alert makes them fill up post-mortem forms and attend mandatory late meetings with management why something got triggered for any reason - 99% of people will push to remove the alert altogether, or at least lower the priority. I haven't found a solution that doesn't include a complete overhaul of organization in the enterprise.