logoalt Hacker News

tialaramextoday at 9:16 AM1 replyview on HN

The monitoring is the wrong way up, which is the case almost everywhere I've ever worked.

You want an upside down pyramid, in which every checked subsystem contributes an OK or some failure, and failure of these checks is the most serious failure, so the output from the bottom of your pyramid is in theory a single green OK. In practice, systems have always failed or are operating in some degraded state.

In this design the alternatives are: 1. Monitor says the Geese are Transmogrified correctly or 2. Monitoring detected a Goose Transmogrifier problem, or 3. Goose Transmogrifier Monitor failed. The absence of any overall result is a sign that the bottom of the pyramid failed, there is a major disaster, we need to urgently get monitoring working.

What I tend to see is instead a pyramid where the alternatives 1 and 2 work but 3 is silent, and in a summarisation layer, that can fail silently too, and in subsequent layers the same. In this system you always have an unknown amount of silently failed systems. You are flying blind.


Replies

xorcisttoday at 10:59 AM

Closely related to the ever more popular "We don't need monitoring, we have metrics."