So a single configuration mistake in a single place wiped out external reachability of a major econo...

qazwsxedchac • yesterday at 9:43 PM • 7 replies • view on HN

So a single configuration mistake in a single place wiped out external reachability of a major economy. It happened in the evening local time and should be fixable, modulo cache TTLs, by morning. This will limit the blast radius somewhat.

Still, at this level, brittle infrastructure is a political risk. The internet's famous "routing around damage" isn't quite working here. Should make for an interesting post mortem.

Replies

belorn • yesterday at 11:17 PM

I am reminded of the warning that zonemaster gives about putting your domain name servers on a single AS, as is common practice for many larger providers. A lot of people do not want others to see this as a problem since a single AS is a convenient configuration for routing, but it has the downside of being a single point of failure.

Building redundant infrastructure that can withstand BGP and DNS configuration mistakes are not that simple but it can be done.

➕ show 1 reply

pocksuppet • yesterday at 10:20 PM

DNS is a centralization risk, yes. Somehow we've decided this is fine. DNSSEC isn't the only issue - your TLD's nameservers could also be offline, or censored in your country.

➕ show 3 replies

gerdesj • today at 12:21 AM

"The internet's famous "routing around damage" isn't quite working here."

DNS is a look up service that runs on the internet.

Internet routing of IP packets is what the internet does and that is working fine (for a given value of fine).

You remind me of someone using the term "the internet is down" that really means: "I've forgotten my wifi password".

➕ show 1 reply

Muromec • yesterday at 11:11 PM

>So a single configuration mistake in a single place wiped out external reachability of a major economy.

And fuck nothing at all happened as a result.

➕ show 1 reply

lschueller • yesterday at 10:28 PM

I have a bad feeling, that the impact will be quite severe for some services, as monitoring, performance, and security services might get disrupted. and just cleaning up is a big mess.. Worst case, some ot will experience outage and / or damage. But maybe I am just overestimating the severity of this.

walrus01 • yesterday at 9:53 PM

It looks like a failed key replacement during a scheduled maintenance event. Normally this sort of thing is thoroughly tested and has multiple eyes on for detailed review and planning before changes get committed, but obviously something got missed.

the8472 • yesterday at 10:16 PM

fail-closed protocols have introduced some brittleness. A HTTP 1.0 server from 1999 probably still can service visitors today. A HTTPS/TLS 1.0 server from the same year wouldn't.

➕ show 1 reply

alt Hacker News

Replies