The Dangers of SSL Certificates

73 points • by azhenley • yesterday at 10:41 PM • 90 comments • view on HN

Comments

You need external monitoring of certificate validity. Your ACME client might not be sending failure notifications properly (like happened to Bazel here). The client could also think everything is OK because it acquired a new cert, meanwhile the certificate isn't installed properly (e.g., not reloading a service so it keeps using the old cert).

I have a simple Python script that runs every day and checks the certificates of multiple sites.

One time this script signaled that a cert was close to expiring even though I saw a newer cert in my browser. It turned out that I had accidentally launched another reverse proxy instance which was stuck on the old cert. Requests were randomly passed to either instance. The script helped me correct this mistake before it caused issues.

➕ show 3 replies

donatj • today at 1:01 PM

Once a year for a number of years we would have a small total outage as our Ops team forgot to renew our wildcard certificate. Like clockwork.

It's been a couple of years now so they must have set better reminders for themselves.

I have tried several times to convince them of the joys of ACME, but they're insistent that a Let's Encrypt certificate "looks unprofessional". More professional than a down application in my opinion at least. It's not the early 2000s anymore, no one's looking at your certificate.

➕ show 1 reply

jsiepkes • today at 9:52 AM

I wonder what the point of this blog is. It's kinda easy to rip on certificates without giving atleast one possible way of fixing this, even if it's an unrealistic one.

Sure, the low-level nitty gritty of managing keys and certificates for TLS is hard if you don't have the expertise. You don't know about the hundreds of ways you can get bitten. But all the pieces for a better solution are there. Someone just needs to fold it into a neater higher level solution. But apparently by the time someone gained the expertise to manage this complexity they also loose interest in making a simple solution (I know I have).

> You can’t set the SSL certificate expiration so it kicks in at different times for different cohorts of users.

Of course you can, if you really want to. You could get different certificates with different expiry times for your reverse (ingress) proxies.

A more straight forward solution is to have monitoring which retrieves the certificate on your HTTPS endpoints and alert when the expiry time is sooner than it ever should be (i.e. when it should already have been renewed). For example by using Prometheus and ssl_exporter [1].

> and the renewal failures didn’t send notifications for whatever reason.

That's why you need to have deadman switch [2] type of monitoring in your alerting. That's not specific to TLS BTW. Heck even your entire Prometheus infra can go down. A service like healthchecks.io [3] can help with "monitoring the monitors".

[1] https://github.com/ribbybibby/ssl_exporter [2] https://en.wikipedia.org/wiki/Dead_man%27s_switch [3] https://healthchecks.io/

dvratil • yesterday at 11:59 PM

Happened on the first day of my first on-call rotation - a cert for one of the key services expired. Autorenew failed, because one of the subdomains on the cert no longer resolved.

The main lesson we took from this was: you absolutely need monitoring for cert expiration, with alert when (valid_to - now) becomes less than typical refresh window.

It's easy to forget this, especially when it's not strictly part of your app, but essential nonetheless.

aljgz • today at 12:52 PM

No criticism of SSL-Certs in particular.

Essentially the flip side of any critical but low maintenance part of your system: it's so reliable that you can forget to have external monitors, it's reliable enough that it can work for years without any manual labor, it's so critical that can break everything.

Competent infra teams are really good at going over these. But once in a while one of them slips through. It's not a failure of the reliable but critical subsystem, it's a failure mode of humans.

One of the main ways "How Complex Systems Fail"

tialaramex • today at 9:16 AM

The monitoring is the wrong way up, which is the case almost everywhere I've ever worked.

You want an upside down pyramid, in which every checked subsystem contributes an OK or some failure, and failure of these checks is the most serious failure, so the output from the bottom of your pyramid is in theory a single green OK. In practice, systems have always failed or are operating in some degraded state.

In this design the alternatives are: 1. Monitor says the Geese are Transmogrified correctly or 2. Monitoring detected a Goose Transmogrifier problem, or 3. Goose Transmogrifier Monitor failed. The absence of any overall result is a sign that the bottom of the pyramid failed, there is a major disaster, we need to urgently get monitoring working.

What I tend to see is instead a pyramid where the alternatives 1 and 2 work but 3 is silent, and in a summarisation layer, that can fail silently too, and in subsequent layers the same. In this system you always have an unknown amount of silently failed systems. You are flying blind.

➕ show 1 reply

philippta • today at 8:35 AM

When I connect my server over SSH, I don't have to rotate anything, yet my connection is always secure.

I manually approve the authenticity of the server on the first connection.

From then, the only time I'd be prompted again would be, if either the server changed or if there's a risk of MITM.

Why can't we have this for the web?

➕ show 4 replies

teunispeters • today at 2:25 PM

One of the interesting things in the ISO 15118-2 (and ISO 15118-20) protocols for EV charging, is that they include a check for "is your contract certificate expiring soon?".

So yeah, certificate timelines can be monitored, completely with warnings ahead of time.

Corollary : the service checking the certificates should have a reasonably accurate time.

➕ show 1 reply

OhMeadhbh • today at 4:07 PM

Meh. Seems like the author just doesn't want to have to remember to renew his certs. But I guess "standard tooling makes it harder than it should be for people focused on things other than renewing certs to easily figure out what they're supposed to do" is a valid critique. Suggestions for how to make things better would have been nice.

loloquwowndueo • yesterday at 11:05 PM

There are plenty of other technologies whose failure mode is a total outage, it’s not exclusive to a failed certificate renewal.

A certificate renewal process has several points at which failure can be detected and action taken, and it sounds like this team was relying only on a “failed to renew” alert/monitor.

A broken alerting system is mentioned “didn’t alert for whatever reason”.

If this certificate is so critical, they should also have something that alerts if you’re still serving a certificate with less than 2 weeks validity - by that time you should have already obtained and rotated in a new certificate. This gives plenty of time for someone to manually inspect and fix.

Sounds like a case of “nothing in this automated process can fail, so we only need this one trivial monitor which also can’t fail so meh” attitude.

➕ show 3 replies

firesteelrain • today at 12:43 AM

Operationally, the issue is rooted in simple monitoring and accurate inventory. The article is apt: “ With SSL certificates, you usually don’t have the opportunity to build up operational experience working with them, unless something goes wrong”

You can update your cert to prepare for it by appending—-NEW CERT—-

To the same file as ——-OLD CERT—-

But you also need to know where all your certificates are located. We were using Venafi for the auto discovery and email notifications. Prometheus ssl_exporter with Grafana integration and email alerts works the same. The problem is knowing where all hosts, containers and systems that have certs are located. Simple nmap style scan of all endpoints can help. But, you might also have containers with certs or you might have certs baked into VM images. Sure, there all sorts of things like storing the cert in a CICD global variable, bind mounting secrets, Vault Secret Injector, etc

But it’s all rooted in maintaining a valid, up to date TLS inventory. And that’s hard. As the article states: “ There’s no natural signal back to the operators that the SSL certificate is getting close to expiry. To make things worse, there’s no staging of the change that triggers the expiration, because the change is time, and time marches on for everyone. You can’t set the SSL certificate expiration so it kicks in at different times for different cohorts of users.”

Every time this happens you whack a mole a change. You get better at it but not before you lose some credibility

➕ show 1 reply

1970-01-01 • today at 1:32 AM

I agree with this. Certs are designed to function as digital cliff. They will either be accepted or they won't, with no safe middle ground. Therefore all certs in a chain can only be as reliable as the least understood cert in your certificate management.

gmuslera • today at 12:26 AM

If you think SSL certificates are dangerous, try seeing the dangers of NOT using them, specially for a service that is a central repository of artifacts meant to be automatically deployed.

It is not about encryption (that a self-signed certificate lasting till 2035 will suffice), but verification, who am I talking with, because reaching the right server can be messed up with DNS or routing, among other things. Yes, that adds complexity, but we are talking more about trust than technology.

And once you recognize that it is essential to have a trusted service, then give it the proper instrumentation to ensure that it work properly, including monitoring and expiration alerts, and documentation about it, not just "it works" and dismiss it.

May we retitle the post as "The dangers of not understanding SSL Certificates"?

➕ show 1 reply

flowerlad • yesterday at 11:36 PM

We need a way to set multiple SSL certificates with overlapping duration. So if one certificate expires the backup certificate will become active. If the overlap is a couple of months then you have plenty of time to detect and fix the issue.

Having only one SSL certificate is a single point of failure, we have eliminated single points of failure almost everywhere else.

➕ show 5 replies

jeffrallen • today at 2:54 PM

The blackbox exporter from Prometheus publishes the "number of seconds until expiration" as part of the metrics of every HTTPS fetch. Set an alert with 30 days warning, and then don't ignore the alerts.

Problem solved.

PS: It would be nice if it could check whois for the expiration of your domain too, but I haven't seen that yet.

0x073 • today at 12:00 AM

And it get worse, as they are changing the max days to until 47 in 2029.

➕ show 1 reply

nrhrjrjrjtntbt • today at 5:13 AM

As always, you need a test that runs and notifies SRE or oncall. Ideally 14 or maybe 28 days before expiry.

JackSlateur • today at 11:46 AM

But certificates work as intended

Of course, if your certificate is expired, then "the failure mode is the opposite of graceful degradation"

Just like when your password is wrong: you cannot login, the failure mode is the opposite of graceful degradation

whirlwin • today at 7:36 AM

TLS certificates is not the only technology for which the default mode is failure. What about disks, databases or syntax errors in configuration files in general?

In technology, there are known problems and unknown problems. Expiring TLS certificates is a known problem which has an established solution.

Imagine if only some of the requests failed because a certificate is about to expire. That would be a debugging nightmare.

0xbadcafebee • today at 7:34 AM

> With SSL certificates, you usually don’t have the opportunity to build up operational experience working with them, unless something goes wrong. And things don’t go wrong that often with certificates

Don't worry. With 2 or 3 industry players dictating how all TLS certs work, now your certs will expire in weeks rather than years, so you will all be subject to these failures more frequently. But as a back-stop to process failures like this, use automated immutable runbooks in CI/CD. It works like this:

1) Does it need a runbook? Ask yourself, if everything was deleted tomorrow, do you (and all the other people) remember every step needed to get everything running again? If not, it needs a runbook.

2) What's a runbook? It's a document that gives step by step instructions to do a thing. The steps can be text, video recordings, code/shell snippets, etc as long as it does not assume anything and gives all necessary instructions (or links to them) so a random braindead engineer at 3am can just do what it says and it'll result in a working thing.

3) Automate the runbook over time. Put more and more of the steps into some kind of script the user can just run. Put the script into a Docker container so that everyone's laptop environment doesn't have to be identical for the steps to work.

4) Run the containerized script from CI/CD. This ensures all credentials, environment vars, networking, etc are the same when it runs which better ensures success, and that leads to:

5) Running it frequently/on a schedule. Most CI/CD systems support scheduled jobs. Run your runbooks frequently to identify unexpected failures and fix bugs. Most of you get notifications for failed builds, so you'll see failed runbooks. If you use a cron job on a random server, the server could go down, the job could get deleted, or the reports of failure could go to /dev/null; but nobody's missing their CI/CD build failures.

Running runbooks from CI/CD is a game changer. Most devs will never update a document. Some will update code they run on their laptop. But if it runs from CI/CD, now anyone can run it, and anyone can update it, so people actually do keep it up to date.

Spivak • today at 4:22 AM

Infra person here: you will need external monitoring at some point because checking that your site is up all over the world isn't something you want to do in house. Not because you couldn't but because their outages are likely to be uncorrelated with yours—AWS notwithstanding.

Anyway you'll have one of these things anyway and I haven't seen one yet that doesn't let you monitor your cert and send you expiration notices in advance.

superkuh • yesterday at 11:43 PM

For corporations, institutions, and for-profits this matters and there's no real good solution.

But for human persons and personal websites HTTP+HTTPS fixes this easily and completely. You get the best of both worlds. Fragile short lifetime pseudo-privacy if you want it (HTTPS) and long term stable access no matter what via HTTP. HTTPS-only does more harm than good. HTTP+HTTPS is far better than either alone.

➕ show 1 reply

throw20251220 • today at 12:03 AM

TLS certificates… SSL is some old Java anachronism.

> There’s no natural signal back to the operators that the SSL certificate is getting close to expiry.

There is. The not after is right there in the certificate itself. Just look at it with openssl x509 -text and set yourself up some alerts… it’s so frustrating having to refute such random bs every time when talking to clients because some guy on the internet has no idea but blogs about their own inefficiencies.

Furthermore, their autorenew should have been failing loud and clear, everyone should know from metrics or logs… but nobody noticed anything.

➕ show 4 replies

deIeted • today at 1:51 AM

Nobody to blame but yourselves.

How long did it take for us to get to a "letsencrypt" setup? and exactly 100ms before that existed, you (meaning 90% of you) mocked and derided that very idea

thecosmicfrog • today at 12:57 AM

> the failure mode is the opposite of graceful degradation. It’s not like there’s an increasing percentage of requests that fail as you get closer to the deadline. Instead, in one minute, everything’s working just fine, and in the next minute, every http request fails.

This has given me some interesting food for thought. I wonder how feasible it would be to create a toy webserver that did exactly this (failing an increasing percentage of requests as the deadline approaches)? My thought would be to start failing some requests as the deadline approaches a point where most would consider it "far too late" (e.g. 4 hours before `notAfter`). At this point, start responding to some percentage of requests with a custom HTTP status code (599 for the sake of example).

Probably a lot less useful than just monitoring each webserver endpoint's TLS cert using synthetics, but it's given me an idea for a fun project if nothing else.

➕ show 3 replies

alt Hacker News

The Dangers of SSL Certificates

Comments