logoalt Hacker News

Understanding ZFS Scrubs and Data Integrity

48 pointsby zdwlast Wednesday at 8:00 PM18 commentsview on HN

Comments

thatckstoday at 5:46 AM

The article is correct but it downplays an important limitation of ZFS scrubs when it talks about how they're different from fsck and chkdsk. As the article says (in different words), ZFS scrubs do not check filesystem objects for correctness and consistency; it only checks that they have the expected checksum and so have not become corrupted due to disk errors or other problems. Unfortunately it's possible for ZFS bugs and issues to give you filesystem objects that have problems, and as it stands today ZFS doesn't have anything that either checks or corrects these. Sometimes you find them through incorrect results; sometimes you discover they exist through ZFS assertion failures triggering kernel panics.

(We run ZFS in production and have not been hit by these issues, at least not that we know about. But I know of some historical ZFS bugs in this area and mysterious issues that AFAIK have never been fully diagnosed.)

show 2 replies
klempnertoday at 6:05 AM

>HDDs typically have a BER (Bit Error Rate) of 1 in 1015, meaning some incorrect data can be expected around every 100 TiB read. That used to be a lot, but now that is only 3 or 4 full drive reads on modern large-scale drives. Silent corruption is one of those problems you only notice after it has already done damage.

While the advice is sound, this number isn't the right number for this argument.

That 10^15 number is for UREs, which aren't going to cause silent data corruption -- simple naive RAID style mirroring/parity will easily recover from a known error of this sort without any filesystem layer checksumming. The rates for silent errors, where the disk returns the wrong data that benefit from checksumming, are a couple of orders of magnitude lower.

itchingsphynxtoday at 5:13 AM

>Most systems that include ZFS schedule scrubs once per month. This frequency is appropriate for many environments, but high churn systems may require more frequent scrubs.

Is there a more specific 'rule of thumb' for scrub frequency? What variables should one consider?

show 5 replies