> The failure was caused by a timing-dependent race condition in hyper’s HTTP/1 connection handling. When the reader was slower and the socket buffer filled, poll_flush returned Poll::Pending, but the dispatch loop discarded that result. Hyper then treated the response as complete and shut down the socket while data remained buffered internally, causing the client to receive an EOF before the full body arrived.
https://github.com/hyperium/hyper/issues/4022
Saved you 3000 words
The detection story here is what's interesting from an SRE perspective. A 200 OK with truncated data and no error logs is about the hardest class of bug to catch with standard monitoring — your error rate is flat, your latency looks normal, and the only signal is a customer saying "my image is broken."
The race condition aspect makes it worse: it only triggers when the reader is slower than the writer, which in production means it's intermittent and load-dependent. The kind of thing synthetic monitoring almost never catches because your test client is usually fast.
[dead]
How does terrible code like this survive so long in such a key piece of infrastructure:
let _ = self.poll_read(cx)?;
let _ = self.poll_write(cx)?;
let _ = self.poll_flush(cx)?;
Surely at the very least a linter should have flagged that the return values aren't handled.
Cloudflare does not notice (until a customer complains) that they are sending broken responses at scale? I would have thought they would notice this from sampling and linting a few replies.. just in case they did something like Cloudbleed again.