logoalt Hacker News

kyledraketoday at 7:11 AM3 repliesview on HN

Without getting into specific stuff I've run into, automated stuff just, breaks.

This is a living organism with moving parts and a time limit - you update nginx with a change that breaks .well-known by accident, or upgrade to a new version of Ubuntu and suddenly some dependency isn't loading correctly, or that UUID generator you depended on to generate the name for the challenge doesn't get loaded, or certbot becomes obsolete because of some API change and you can't upgrade to the latest because the OS is older and you installed it from the package manager.

You eventually see it in your exception monitoring or when an ssl monitor detects the cert is about to expire. Then you have to drop that other urgent thing you needed to get done, come in and debug it, fix it, and re-issue all the certs at the rate limit allowed. That's assuming you have that monitoring - most sites probably don't.

If you detect that issue with 1/3 of the cert left, you will now have 15 days to figure that out instead of 30. If you can't finish it in time, or you don't learn about it in time, the site(s) hard fail on every web browser that visits and you've effectively got a full site outage until you repair it.

So you discover it's because of certbot not working with a new API change, and you can't upgrade with the package manager. Now you need to figure out how to compile it from source, but it doesn't like the python that is currently installed and now you need to install that from source, but that version of python breaks your python web app so you have to figure out how to migrate your app to that version of python before you can do that, and the programmer that can do that is on a week long whitewater rafting trip in Idaho.

Aside from all that, what happens if a hacker manages to wreck the let's encrypt infra so badly they need 2 weeks to get it back online? The internet archive was offline for weeks after a ddos attack. The cloudflare outage took one site of mine down for less than 10 minutes, it's not hard to imagine a much worse outage for the web here.


Replies

fcatalantoday at 7:37 AM

AKA the real world, a place where you have older appliances, legacy servers, contractual constraints and better things to do than watch a nasty yearly ritual become a nasty monthly ritual. I need to make sure SSL is working in a bunch of very heterogeneous stuff but not in a position to replace it and/or pick an authority with better automation. I just suck it up and dread when a "cert day" looms closer.

Sometimes these kind of decisions seem to come from bodies that think the Internet exists solely for doing the thing they do.

Happens to me with the QA people at our org. They behave as if anything happens just for the purpose of having them measure it, creating a Heisenberg situation where their incessant narrow-minded meddling makes actually doing anything nearly imposible.

crotetoday at 7:42 AM

The same happens with manual processes done once a year - you just aren't aware of it until renewal.

Consider the inevitable need for immediate renewal due to an incident. Would you rather have this renewal happen via a fast, automated and well-tested process, or a silently broken slow and manual one?

show 1 reply
cpachtoday at 11:36 AM

“Aside from all that, what happens if a hacker manages to wreck the let’s encrypt infra so badly they need 2 weeks to get it back online?”

There are other CAs that offer certs via ACME. For example, Google Trust Services.

show 1 reply