Measurement and alerting is usually done in business metrics, not the causes. That way you catch classes of problems.
Not sure about expected loss, that's a decay rate?
But stuck jobs are via tasks being processed and average latency.