My first in-prod corrupted hard drive problem

30 points • by r1chk1t • today at 7:35 PM • 23 comments • view on HN

Comments

I'm surprised to have read to the end and found that they're still not performing any hardware monitoring and alerting. SMART may not always show up pre-failure warnings but when it does they can usually be trusted.

➕ show 1 reply

Retr0id • today at 7:50 PM

> So how were we able to recover the database and the data inside it? Most of the data was probably still intact, only a few sectors were unreadable. Once those were either restored (rewritten with a strong signal) or remapped by the drive’s firmware, the filesystem and the database engine could read the file end-to-end again. SQL Server pages also have checksums, so if any page came back wrong rather than unreadable, we’d have known. We got lucky: the corruption was at the magnetic-signal level, not at the “platter is scratched” level.

This doesn't quite seem to follow. As described, neither of the "recovery" methods actually restore lost data. So why weren't any of the SQL pages left in a bad state?

➕ show 1 reply

jtchang • today at 7:54 PM

Confused as to the actual root cause. Don't all hard drives provide SMART diagnostics these days? Was it really bad sectors?

➕ show 1 reply

barrkel • today at 9:57 PM

HDD failures don't normally have a software root cause. Treat HDD failures as a certainty. It's just a matter of time.

Felger • today at 8:39 PM

Hi, I believe you are quite new to workstation/hardware admin. Lots of things to say here (not native english speaking so basic style, sorry for that) :

Disk errors logged in the system event log are from the I/O layer, low-level class driver (msahci.sys) / filter drivers. See Windows Storage Driver Architecture : https://learn.microsoft.com/en-us/windows-hardware/drivers/s...

A disk error of this type showing in the event log must immediately be treated as an actual disk issue. This is a low level issue below the actual filesystem and application/services. Seems here the .mdf/ldf of your SQL database used one or more bad sectors on the disk surface.

Your disk seems to be only one on the system, so the first thing to do is check SMART status, for example with Crystaldiskinfo (the most used and user-friendly free portable windows software).

It would very probably have shown a warning state for the internal disk, with probably one or more (judging the quantity of disk error entries in your log) for Attribute C5 "Current Pending sectors" and probably some in Att 05 - "Reallocated sectors count" and/or Att C4 - "Realloc event Count".

Second thing to do is trying to backup your data as fast as possible. In your case related to a Ms SQL database, trying to dump it / backup first was the good move. Sadly (DR pro experience here), weak surface / failing Head Stack Assembly of a traditionnal HDD from most vendors has more difficulties reading correctly a sector than writing it.

If the dump/backup fails, the second choice would have be to try to a sector-to-sector dump approach of the whole disk, with either a online (from OS) software capable of reading sectors from the boot disk (didn't try if HDD Raw Copy Tool 2.6 supports it), or an offline solution like Clonezilla, Acronis True Image, Aomei backupper, etc. But offline solution means offline computer and service...

I didn't exactly understood if you had an actual backup of the data or an image of the whole disk. Considering the critical usage of this station, you should have both running : daily data backup or more + up-to-date disk image ready. whatever the type of disk (HDD/SSD). And a spare, identical computer.

As for repair of HDDs "weak sectors" (meaning Current Pending Sectors), it is indeed possible, often with complete data recovery. If not, the sector will be left as is, or may be remapped if written to 0 (it will then shift from Current pending to realloc sector count).

Hard disk Sentinel Pro as such features (Disk repair, Quick Fix), it works quite well. The result vary greatly from one type of failure to another, as from one disk maker to another.

Note that if the SMART shows more than a little dozen of sectors, the head (amp/preamp) is probably failing, making weaker magnetically-wise sectors too difficult to read and/or write. In this case, the count of current and remapped increases every repair/check pass made by the tools. In this case, the drive is toast and must be replaced ASAP.

SSD are a complete different case for repair.

A older autonomous tool, Spinrite, was specialized for this usage (accurate recovery of data), but veeeeery slow.

RAID pertinence : fortunately, it is an expecteed case as most SATA disks are prone to HSA failure before not initialyzing at all. A RAID 1 mirror would have protected you from a mirrored defect accross the two disks.

The RAID controller (true hardware controller like LSI/Avago or Microsemi) or even fake raid like Intel RST / VROC maintain data integrity accross the array's disks. The defective disk will raise bad blocks (that will get marked in metadata of the Raid Volume), but the others disks are fine and the data can be read safely. If too many errors are reported on a disk (very few in fact on most controllers), it will be labelled as failed and taken down from the array.

➕ show 2 replies

pshirshov • today at 8:28 PM

So, you were not using a striped mirror ZFS for a prod database? What could go wrong, yep.

➕ show 1 reply

louwrentius • today at 10:02 PM

> This disk was probably dying. I did some research, and a RAID wouldn’t have saved it either, RAID protects against drive failure, not against silent page corruption that gets faithfully replicated to every mirror.

I dispute this was a 'silent' drive error as many systems reported read errors. Silent data corruption on hard drives is extremely rare, due to the tons of checksums used on all data. Maybe I'm wrong but I bet there are read errors on the drive in the appropriate system logs.

I feel that people confuse regular 'bad blocks' with 'silent data corruption' and there is a huge difference[0].

[0]: https://louwrentius.com/what-home-nas-builders-should-unders...

➕ show 1 reply

pixel_popping • today at 8:04 PM

I feel the pain OP.

Over the last decade, I've ran hundreds of servers if not thousands, and I entirely stopped using hard drives, now it's solely SSD/NVMe where the failure rate in practice is incredibly lower, I've had my fair share of middle-night runs because websites are offline or whatever to end-up in a hard drive diagnosis circus.

Imo, the peace of mind you get worth the cost, it also allows you to rethink development entirely, typical example would be that suddenly, copying all node_modules or rust deps is a great idea with 10Gbit/s bandwidth and fast drives (yes, I expect people to shit on me for saying this, please give me the counterarguments if you downvote me), many things change if you have a higher base performance assumption, storage is relatively cheap as well. I would never advise anyone that wants to run continuously in prod with low friction to get servers with HDD.

I get that for some use cases it's not possible, but for large majority of use cases, it's clearly not HDD that is the cost burden. $50 servers gets you TBs of SSD, of course don't go with VPS or "Cloud" if you intend to change your development based on new performance assumptions, it blows my mind the numbers of people paying thousand of dollars just to handle what, 100K visitors a day? That fits on a $100 server and a bunch of Kimsufi hosted across the world as a CDN.

People are overcomplicating infrastructure, big time (which leads to more problems, higher maintenance, security issues and so on).

➕ show 2 replies

alt Hacker News

My first in-prod corrupted hard drive problem

Comments