I wrote this after seeing cases where instances were technically “up” but clearly not serving traffic correctly.
The article explores how client-side and server-side load balancing differ in failure detection speed, consistency, and operational complexity.
I’d love input from people who’ve operated service meshes, Envoy/HAProxy setups, or large distributed fleets — particularly around edge cases and scaling tradeoffs.
Hi author, a tangent:
<meta name="viewport" content="width=device-width, initial-scale=1" />
For us who need to zoom in on mobile devices.Thanks for writing something that's accessible to someone who's only used Nginx server-side load balancing and didn't know client-side load balancing existed at higher scale.
I don't think you really need sub-millisecond detection to get sub-millisecond service latency. You mainly need to send backup requests, where appropriate, to backup channels, when the main request didn't respond promptly, and your program needs to be ready for the high probability that the original request wins this race anyway. It's more than fine that Client A and Client B have differing opinions about the health of the channel to Server C at a given time, because there really isn't any such thing as the atomic health of Server C anyway. The health of the channel consists of the client, the server, and the network, and the health of AC may or may not impact the channel BC. It's risky to let clients advertise their opinions about backend health to other clients, because that leads to the event where a bad client shoots down a server, or many servers, for every client.
Modern LBs, like HAProxy, support both active & passive health checks (and others, like agent checks where the app itself can adjust the load balancing behavior). This means that your "client scenario" covering passive checks can be done server side too.
Also, in HAProxy (that's the one I know), server side health checks can be in millisecond intervals. I can't remember the minimum, I think it's 100ms, so theoretically you could fail a server within 200-300ms, instead of 15seconds in your post.