logoalt Hacker News

wparadyesterday at 9:45 PM2 repliesview on HN

I think there are two worlds of thought to the "AWS is totally dead everywhere". And that's: * It is never going to happen due to the way AWS is designed (or at least told to us, which explains why it is so hard to execute actions across regions.) * It will happen but then everything else is going to be dead, so what's the point?

One problem we've run into, which is the "DNS is single point of failure" is that there isn't a clear best strategy to deal with "failover to a different cloud at the DNS routing level."

I'm not the foremost expert when it comes to ASNs and BGPs, but from my understanding that would require some multi-cloud collaboration to get multiple CDNs to still resolve, something that feels like it would require both multiple levels of physical infrastructure as well as significant cost to actually implement correctly compared to the ROI for our customers.

There's a corollary here for me, which is, still as simple as possible to achieve the result. Maybe there is a multi-cloud strategy, but the strategies I've seen still rely on having the DNS zone in one provider that fail-overs or round-robins specific infra in specific locations.

Third party health checks have less of a problem of "tainting" and more just cause further complications, as you add in complexity to resolving your real state, the harder it is to get it right.

For instance, one thing we keep going back and forth on is "After the incident is over, is there a way for us to stay failed-over and not automatically fail back".

And the answer for us so far is "not really". There are a lot of bad options, which all could have catastrophic impacts if we don't get it exactly correct, and haven't come with significant benefits, yet. But I like to think I have an open mind here.


Replies

parliament32today at 12:39 AM

There is good options if you're willing to pay for them, but they have nothing to do with DNS. You will never get DNS TTLs low enough (and respected) to prevent a multi-minute service interruption in cases like these.

Proper HA is owning your own IP space and anycast advertising it from multiple IXes/colos/clouds to multiple upstreams / backbone networks. BGP hold times are like a dead-mans-switch and will ensure traffic stops being routed in that direction within a few seconds in case of a total outage, plus your own health-automation should disable those advertisements when certain things happen. Of course, you need to deal with the engineering complexity of your traffic coming in to multiple POPs at once, and it won't be cheap at all (to start, you're looking at ~10kUSD capex for a /24 of IP space, plus whatever the upstreams charge you monthly), but it will be very resilient to pretty much any single point of failure, including AWS disappearing entirely.

toast0yesterday at 10:43 PM

It's painful, but you can split your DNS across multiple providers. It's not usually done other than during migrations, but if you put two NS names from providerA and two from providerB, you'll get a mix of resolution (most high profile domains have 4 NS names; sometimes based on research/testing, sometimes based on cargo culting; I assume you want to fit in... but amazon.com has 8, and the DNS root and some high profile tlds have 13, so you do you :)). If either provider fails and stops responding, most resolvers will use the other provider. If one provider fails and returns bad data (including errors) or even can no longer be updated [1], the redundancy doesn't really help --- you probably went from a full outage that's easy to diagnose to a partial outage that's much harder to diagnose; and if both providers are equally reliable, you increased your chances of having an outage.

[1] But, it's DNS; the expectation is that some resolvers, hopefully very few of them, will cache data as if your TTL value was measured in days. IMHO, If you want to move all your traffic in a defined timeframe, DNS is not sufficient.