logoalt Hacker News

raddantoday at 12:26 PM2 repliesview on HN

I seriously doubt that single bit errors on the scale of OpenAI workloads really matters very much, particularly for a domain that is already noisy.


Replies

PunchyHamstertoday at 2:09 PM

Till they hit your program memory. We just had really interesting incident where one of the Ceph nodes didn't fail but started acting erratically, bringing whole cluster to a crawl, once a failing RAM module had some uncorrectable errors.

And that was caught because we had ECC. If not for that we'd be replacing drives, because metrics made it look like it is one of OSDs slowing to a crawl, which usual reason is drive dying.

Of course, chance for that is pretty damn small, bit also their scale is pretty damn big.

close04today at 2:28 PM

Random bit flips is their best path to AGI.