Im curious about this: because in my experience (working on smaller services though), a small number...

Doohickey-d • today at 12:53 PM • 2 replies • view on HN

Im curious about this: because in my experience (working on smaller services though), a small number of errors is always there, as a "baseline".

Recently there was this: https://news.ycombinator.com/item?id=47252971 "10% of Firefox crashes are caused by bitflips"

Which makes me think a small amount of random issues which happen even though nothing is broken, is normal everywhere. Especially once move things around on a network, there's potential for a lot more random errors.

Replies

bobthepanda • today at 2:38 PM

It’s where monitoring for 9s is more important at that scale than absolute errors. So long as degradation is graceful or retried it should not be a massive problem.

It does require constant tuning and adjustment though.

KPGv2 • today at 1:24 PM

Bitflips are something that can happen in consumer-grade RAM, so that tracks (and it's comforting that wayward cosmic rays are a substantial reason for an application's crashes!), but on enterprise servers, they will run ECC RAM that is very resistant to bit flips.

This is why data hoarders who have NASes with lots of space insist on running their servers with ECC RAM despite it being significantly more expensive. Because bit flips, for all intents and purposes, cannot happen. The RAM itself detects and corrects for them.

I wouldn't expect bit flips to be a significant contributor to enterprise problems.

➕ show 2 replies

alt Hacker News

Replies