Have you considered the scenario of "everything is so dead in aws", that the check doesn&#...

indigodaddy • yesterday at 9:16 PM • 1 reply • view on HN

Have you considered the scenario of "everything is so dead in aws", that the check doesn't happen, plus the backends are dead too (this is assuming the backend services live in aws as well) ? But I'd guess in that case you'd know quickly enough from supplementary alerting (you guys don't seem the type to not have some sort of awesome monitoring in place) and you have a different/worse DR problem on your hands.

As far as the OP's point though, I'm going to probably assume that the health checks need to stay within/from AWS because 3rd party health checks could taint/dilute the point of the in-house AWS HC service to begin with.

Replies

wparad • yesterday at 9:45 PM

I think there are two worlds of thought to the "AWS is totally dead everywhere". And that's: * It is never going to happen due to the way AWS is designed (or at least told to us, which explains why it is so hard to execute actions across regions.) * It will happen but then everything else is going to be dead, so what's the point?

One problem we've run into, which is the "DNS is single point of failure" is that there isn't a clear best strategy to deal with "failover to a different cloud at the DNS routing level."

I'm not the foremost expert when it comes to ASNs and BGPs, but from my understanding that would require some multi-cloud collaboration to get multiple CDNs to still resolve, something that feels like it would require both multiple levels of physical infrastructure as well as significant cost to actually implement correctly compared to the ROI for our customers.

There's a corollary here for me, which is, still as simple as possible to achieve the result. Maybe there is a multi-cloud strategy, but the strategies I've seen still rely on having the DNS zone in one provider that fail-overs or round-robins specific infra in specific locations.

Third party health checks have less of a problem of "tainting" and more just cause further complications, as you add in complexity to resolving your real state, the harder it is to get it right.

For instance, one thing we keep going back and forth on is "After the incident is over, is there a way for us to stay failed-over and not automatically fail back".

And the answer for us so far is "not really". There are a lot of bad options, which all could have catastrophic impacts if we don't get it exactly correct, and haven't come with significant benefits, yet. But I like to think I have an open mind here.

➕ show 2 replies

alt Hacker News

Replies