logoalt Hacker News

jpollockyesterday at 9:40 PM0 repliesview on HN

Measurement and alerting is usually done in business metrics, not the causes. That way you catch classes of problems.

Not sure about expected loss, that's a decay rate?

But stuck jobs are via tasks being processed and average latency.