Back in the day (10-12 years ago) at a telecom/cable we accomplished this with F5 Big IP GSLB DNS (and later migrated to A10's GSLB equivalent devices) as the auth DNS server for services/zones that required or were suitable for HA. (I can't totally remember but I'm guessing we must have had a pretty low TTL for this).
Had no idea that Route 53 had this sort of functionality
Is there not an inherent risk using an AWS service (Route 53) to do the health check? Wouldn’t it make more sense to use a different cloud provider for redundancy?
Hey, I wrote that article!
I'll try to add comments and answer questions where I can.
- Warren
Interesting how engineers like to nerd out about SLAs, but never claim or issue credits when something does occur.
> During this time, us-east-1 was offline, and while we only run a limited amount of infrastructure in the region, we have to run it there because we have customers who want it there
> [Our service can only go down] five minutes and 15 seconds per year.
I don't have much experience in this area, so please correct me if I'm mistaken:
Don't these two quotes together imply that they have failed to deliver on their SLA for the subset of their customers that want their service in us-east-1? I understand the customers won't be mad at them in this case, since us-east-1 itself is down, but I feel like their title is incorrect. Some subset of their service is running on top of AWS. When AWS goes down, that subset of their service is down. When AWS was down, it seems like they were also down for some customers.
I'm interested in how they measure that downtime. If you're down for 200 milliseconds, does that accumulate. How do you even measure that you're down for 200ms.
(For what it's worth, for some of my services, 200ms is certainly an impact, not as bad as 2 seconds out outage but still noticable and reportable)
This is a rare case where the original bait-y title is probably better than the de-bait-ified title, because the actual article is much less of a brag and much more of an actual case study.
[dead]
This is probably one of the best summarizations of the past 10 years of my career in SRE. Once your systems get complex enough, something is always broken and you have to prepare for that. Detection & response become just as critical as pre-deploy testing.
I do worry about all the automation being another failure point, along with the IaC stuff. That is all software too! How do you update that safely? It's turtles all the way down!