logoalt Hacker News

AntonFribergtoday at 3:03 PM0 repliesview on HN

Came across this at work as well during the early days of Kubernetes. At that time all VM storage was entirely backed by NFS or Network attached iSCSI, there was no local disk at all. We noticed intermittent issues that caused kube API to stop responding but nothing serious.

Then all the sudden there was a longer lasting outage where the ETCD did not really recover on its own. The kube API buffered requests but eventually crashed due to OOM.

The issue was due to the Kubernetes distro we picked (rancher) which runs a separate cluster in front of the deployed ones for the control plane. Large changes happening in the underlying clusters needed to be sent to the control plane. After we hit the latency threshold it started failing at regular intervals and the control plane started drifting more and more and needing more and more changes at start up. Until it could not recover.

Solving it took some time and confidence. Manually cleaning up large unused data in the underlying ETCD instances so that they did not cause upstream errors.

It was later during post-mortem investigation that I understood the RAFT algorithm and the storage latency issue. Convincing the company to install local disks took some time but I could correlate kube API issues to disk latency by setting up robust monitoring. Fun times!

The requirements are well documented nowadays! https://etcd.io/docs/v3.1/op-guide/performance/