logoalt Hacker News

When etcd crashes, check your disks first

37 pointsby _ananos_today at 7:18 AM13 commentsview on HN

Comments

watttoday at 6:17 PM

Sounds to me that etcd has data handoff consistency issues, and is just flying by seat of its pants: in other words, the moment IO subsystem hiccups, it just shits the bed?

AntonFribergtoday at 3:03 PM

Came across this at work as well during the early days of Kubernetes. At that time all VM storage was entirely backed by NFS or Network attached iSCSI, there was no local disk at all. We noticed intermittent issues that caused kube API to stop responding but nothing serious.

Then all the sudden there was a longer lasting outage where the ETCD did not really recover on its own. The kube API buffered requests but eventually crashed due to OOM.

The issue was due to the Kubernetes distro we picked (rancher) which runs a separate cluster in front of the deployed ones for the control plane. Large changes happening in the underlying clusters needed to be sent to the control plane. After we hit the latency threshold it started failing at regular intervals and the control plane started drifting more and more and needing more and more changes at start up. Until it could not recover.

Solving it took some time and confidence. Manually cleaning up large unused data in the underlying ETCD instances so that they did not cause upstream errors.

It was later during post-mortem investigation that I understood the RAFT algorithm and the storage latency issue. Convincing the company to install local disks took some time but I could correlate kube API issues to disk latency by setting up robust monitoring. Fun times!

The requirements are well documented nowadays! https://etcd.io/docs/v3.1/op-guide/performance/

kgtoday at 10:09 AM

> etcd is a strongly consistent, distributed key-value store, and that consistency comes at a cost: it is extraordinarily sensitive to I/O latency. etcd uses a write-ahead log and relies on fsync calls completing within tight time windows. When storage is slow, even intermittently, etcd starts missing its internal heartbeat and election deadlines. Leader elections fail. The cluster loses quorum. Pods that depend on the API server start dying.

This seems REALLY bad for reliability? I guess the idea is that it's better to have things not respond to requests than to lose data, but the outcome described in the article is pretty nasty.

It seems like the solution they arrived at was to "fix" this at the filesystem level by making fsync no longer deliver reliability, which seems like a pretty clumsy solution. I'm surprised they didn't find some way to make etcd more tolerant of slow storage. I'd be wary of turning off filesystem level reliability at the risk of later running postgres or something on the same system and experiencing data loss when what I wanted was just for kubernetes or whatever to stop falling over.

show 6 replies