logoalt Hacker News

solaticyesterday at 6:47 PM0 repliesview on HN

Not all alerts are created equal. You should generally have three levels of alerts: critical (which pages somebody, time-to-fix should be ASAP), warning (creates a ticket, time-to-fix should be within a few days), and suspicious (does not notify, appear only on an alert dashboard). The suspicious alerts are there to help guide your investigation on a critical or warning alert.

Each critical and warning alert should link to an "interactive runbook" - a dashboard that combines text instructions along with graphs showing real-time data.

Doing this at scale, correctly, requires both alerts-as-code and dashboards-as-code, which almost nobody does because nobody treats higher-level configuration languages (jsonnet, CUE...) with the attention and respect they deserve /cries-in-yaml