logoalt Hacker News

mgsouth11/20/20240 repliesview on HN

My most vivid gRPC experience is from 10 years or so ago, so things have probably changed. We were heavily Go and micro-services. Switched from, IIRC, protobuf over HTTP, to gRPC "as it was meant to be used." Ran into a weird, flaky bug--after a while we'd start getting transaction timeouts. Most stuff would get through, but errors would build and eventually choke everything.

I finally figured out it was a problem with specific pairs of servers. Server A could talk to C, and D, but would timeout talking to B. The gRPC call just... wouldn't.

One good thing is you do have the source to everything. After much digging through amazingly opaque code, it became clear there was a problem with a feature we didn't even need. If there are multiple sub-channels between servers A and B. gRPC will bundle them into one connection. It also provides protocol-level in-flight flow limits, both for individual sub-channels and the combined A-B bundle. It does it by using "credits". Every time a message is sent from A to B it decrements the available credit limit for the sub-channel, and decrements another limit for the bundle as a whole. When the message is processed by the recipient process then the credit is added back to the sub-channel and bundle limits. Out of credits? Then you'll have to wait.

The problem was that failed transactions were not credited back. Failures included processing time-outs. With time-outs the sub-channel would be terminated, so that wasn't a problem. The issue was with the bundle. The protocol spec was (is?) silent as to who owned the credits for the bundle, and who was responsible for crediting them back in failure cases. The gRPC code for Go, at the time, didn't seem to have been written or maintained by Google's most-experienced team (an intern, maybe?), and this was simply dropped. The result was the bundle got clogged, and A and B couldn't talk. Comm-level backpressure wasn't doing us any good (we needed full app-level), so for several years we'd just patch new Go libraries and disable it.