logoalt Hacker News

iso1631today at 7:44 PM1 replyview on HN

I'm interested in how they measure that downtime. If you're down for 200 milliseconds, does that accumulate. How do you even measure that you're down for 200ms.

(For what it's worth, for some of my services, 200ms is certainly an impact, not as bad as 2 seconds out outage but still noticable and reportable)


Replies

wparadtoday at 8:29 PM

Good catch. The truth is, while we track downtime for incident reporting, it's much more correct to actually be tracking the number of requests that result in a failure. Our SLAs are based on request volume, and not specifically time. Most customers don't have perfect sustained usage. Being down when they aren't running is irrelevant to everyone.

This is where the grey failures can come into play. It's really hard to tell, often impossible to know what the impact of an incident is to a customer, even if you know you are having an incident, without them telling you.

In order to know that you are "down", our edge of the HTTP request would need to be able to track requests. For us that is CloudFront, but if there is an issue before that, at DNS, at network level, etc... we just can't know what the actual impact is.

As far as measuring how you are down. We can pretty accurately know the list of failures that are happening, (when we can know), and what the results are.

That's because most components are behind cloudfront in any case. And if cloudfront isn't having a problem, we'll have telemetry that tells us what the HTTP request/response status codes and connection completions look like. Then it's a matter of measuring from our first detection to the actual remediation being deployed (assuming there is one).

Another thing that helps here is that we have multiple other products that also use Authress, and we can run technology in other regions that can report this information, for those accounts (obviously can't be for all customers), which can help us identify with additional accuracy, but is often unnecessary.