My first in-prod corrupted hard drive problem

(blog.pavementlink.ch)

47 points | by r1chk1t 2 days ago ago

38 comments

  • proactivesvcs 2 days ago ago

    I'm surprised to have read to the end and found that they're still not performing any hardware monitoring and alerting. SMART may not always show up pre-failure warnings but when it does they can usually be trusted.

    • rkagerer a day ago ago

      Hard Disk Sentinel is really good for this type of thing. The developer is awesome and some years ago after I asked for some new features added code to better support my RAID adapter.

    • jeffbee 2 days ago ago

      Wasn't it a conclusion of the Google hard drive reliability study that models based on SMART were not useful? I.e. drives with sector reallocations are much more likely to fail than those without, but their failure rate is still something like 15% per year, so what useful thing can you do with that signal?

      • proactivesvcs 2 days ago ago

        Well I don't see why you'd want to keep running a drive that is showing warning signs, it's just asking for trouble. But even if one doesn't replace them from this data, if you start seeing alerts and at the same time your database suffers from corruption, that also shows the use of SMART.

        • jeffbee 2 days ago ago

          Because taking drives out of service for SMART signals would cost a fortune and almost none of those drives were actually going to fail.

        • estimator7292 2 days ago ago

          N=1, but I had a drive show catastrophic SMART failures once. I figured I'd take the opportunity to tinker with the exposed serial port on the drive's PCB and wiped the SMART values.

          Funny thing was, I didn't actually observe any data loss. I stressed the drive for several days, no errors. It went back in my daily driver for the next 5 years with no failure. It's been 15 years since that happened and the drive still hasn't failed.

          I don't trust SMART anymore.

          • proactivesvcs a day ago ago

            I've known a fair few drives with uncorrectable sectors or adaptor warnings that continue to work but a lot more that have degraded or just outright failed.

  • Retr0id 2 days ago ago

    > So how were we able to recover the database and the data inside it? Most of the data was probably still intact, only a few sectors were unreadable. Once those were either restored (rewritten with a strong signal) or remapped by the drive’s firmware, the filesystem and the database engine could read the file end-to-end again. SQL Server pages also have checksums, so if any page came back wrong rather than unreadable, we’d have known. We got lucky: the corruption was at the magnetic-signal level, not at the “platter is scratched” level.

    This doesn't quite seem to follow. As described, neither of the "recovery" methods actually restore lost data. So why weren't any of the SQL pages left in a bad state?

    • benlivengood 2 days ago ago

      As best as I can tell it was intermittent read failures on some sectors, not permanent failures.

      So if you keep rereading that section of the disk you eventually get all the data, save it somewhere, write a bunch of new patterns over it, then write the original data and verify it reads back correctly many times.

      I believe the article's analysis about RAID is wrong though; most controllers will start resilvering or just fail a drive once it experiences too many IO errors.

  • pshirshov 2 days ago ago

    So, you were not using a striped mirror ZFS for a prod database? What could go wrong, yep.

    • sitzkrieg a day ago ago

      the article describes why they can’t use zfs due to windows server requirements. this is the right choice because zfs on windows is completely garbage

    • r1chk1t 2 days ago ago

      learned the hard way

      • justinclift 2 days ago ago

        Yet at the end it still has this:

        > I did some research, and a RAID wouldn’t have saved it either, RAID protects against drive failure, not against silent page corruption that gets faithfully replicated to every mirror.

        That being said, the article has some strong signal of AI writing in it. So it's possible the author isn't really learning well from the experience either. :(

        • pshirshov 2 days ago ago

          ZFS and ECC do protect against silent page corruption that gets faithfully replicated to every mirror.

          • justinclift 2 days ago ago

            Yeah, that was my point. The author seems to have gotten things wrong in some fundamental way.

  • prirun a day ago ago

    My sister has a Windows 10 laptop she used for her accounting business. One day it decided not to boot, saying there was no boot device. I took the laptop home, took the SSD out (Samsung 1TB), put it in an external USB case, plugged it into another Windows laptop, and it showed up in Explorer. Weird.

    I had another brand-new, identical Samsung SSD, so I hooked both the old and new drive up to a Linux laptop (with USB cases) and tried to dd the old drive to the new drive. That mostly worked, but VERY VERY slowly: it would run fast for 5 seconds and then have no activity for 30 seconds. I had a fan blowing on the old drive to keep it cool because it was running very hot.

    The dd copy would eventually fail and then I'd restart it with appropriate iseek and oseek values. I also did a cmp /dev/zero with the new disk to verify that it was all zeroes (it was brand new), and that allowed me to use conv=sparse on the dd. The reason for that was to avoid writing to ever sector of the new disk; I didn't want to copy sectors from the old drive that had never been accessed (she only used about 250GB of the 1TB).

    It took a couple of days and about 5 restarts to finish the copy, but it did work, and as a precaution, I made another copy of the drive and ran a cmp of the original drive and the 2nd copy (also having to restart cmp several times). Since that compare worked, I knew that all 3 drives had identical content. The new drive worked fine in her laptop and she was mighty glad to see her Windows login screen.

    The thing that made this work, IMO, is that Linux has a longer timeout for errors than Windows apparently does, especially during the boot sequence. Plus Linux allows adjusting the drive timeout, so if the device is doing error recover, which is sometimes slow, it gives it time to finish rather than reporting an error.

    One of my theories was that the bad SSD was overheating, but if that was the case, a cold boot should have worked, with the failures only coming later.

    The other theory is that one of the chips on the SSD failed, so the drive was having to use the ECC codes to correct for the missing information, and the correction process was taking longer than Windows boot would tolerate.

    • BenjiWiebe 12 hours ago ago

      Next time you have a disk where you need to do repeated dd runs over different ranges, or suspect that you might need to, use ddrescue. It tracks which sectors have been recovered (and has lots of useful options).

      You can also get 'partclone' to generate a list (in ddrescue format) of sectors containing data, so you don't need to try to read unused areas of the disk. For the partclone trick to work, the FS does need to be at least somewhat readable.

  • barrkel 2 days ago ago

    HDD failures don't normally have a software root cause. Treat HDD failures as a certainty. It's just a matter of time.

  • jtchang 2 days ago ago

    Confused as to the actual root cause. Don't all hard drives provide SMART diagnostics these days? Was it really bad sectors?

    • alternatetwo 21 hours ago ago

      I’ve had a disk silently corrupt data while reading it without SMART data showing any reallocated sectors at first. at some point it did and Seagate was able to recover many files from it.

    • 2 days ago ago
      [deleted]
    • r1chk1t 2 days ago ago

      Yes there was bad sectors in the SMART diagnostic

  • Felger 2 days ago ago

    Hi, I believe you are quite new to workstation/hardware admin. Lots of things to say here (not native english speaking so basic style, sorry for that) :

    Disk errors logged in the system event log are from the I/O layer, low-level class driver (msahci.sys) / filter drivers. See Windows Storage Driver Architecture : https://learn.microsoft.com/en-us/windows-hardware/drivers/s...

    A disk error of this type showing in the event log must immediately be treated as an actual disk issue. This is a low level issue below the actual filesystem and application/services. Seems here the .mdf/ldf of your SQL database used one or more bad sectors on the disk surface.

    Your disk seems to be only one on the system, so the first thing to do is check SMART status, for example with Crystaldiskinfo (the most used and user-friendly free portable windows software).

    It would very probably have shown a warning state for the internal disk, with probably one or more (judging the quantity of disk error entries in your log) for Attribute C5 "Current Pending sectors" and probably some in Att 05 - "Reallocated sectors count" and/or Att C4 - "Realloc event Count".

    Second thing to do is trying to backup your data as fast as possible. In your case related to a Ms SQL database, trying to dump it / backup first was the good move. Sadly (DR pro experience here), weak surface / failing Head Stack Assembly of a traditionnal HDD from most vendors has more difficulties reading correctly a sector than writing it.

    If the dump/backup fails, the second choice would have be to try to a sector-to-sector dump approach of the whole disk, with either a online (from OS) software capable of reading sectors from the boot disk (didn't try if HDD Raw Copy Tool 2.6 supports it), or an offline solution like Clonezilla, Acronis True Image, Aomei backupper, etc. But offline solution means offline computer and service...

    I didn't exactly understood if you had an actual backup of the data or an image of the whole disk. Considering the critical usage of this station, you should have both running : daily data backup or more + up-to-date disk image ready. whatever the type of disk (HDD/SSD). And a spare, identical computer.

    As for repair of HDDs "weak sectors" (meaning Current Pending Sectors), it is indeed possible, often with complete data recovery. If not, the sector will be left as is, or may be remapped if written to 0 (it will then shift from Current pending to realloc sector count).

    Hard disk Sentinel Pro as such features (Disk repair, Quick Fix), it works quite well. The result vary greatly from one type of failure to another, as from one disk maker to another.

    Note that if the SMART shows more than a little dozen of sectors, the head (amp/preamp) is probably failing, making weaker magnetically-wise sectors too difficult to read and/or write. In this case, the count of current and remapped increases every repair/check pass made by the tools. In this case, the drive is toast and must be replaced ASAP.

    SSD are a complete different case for repair.

    A older autonomous tool, Spinrite, was specialized for this usage (accurate recovery of data), but veeeeery slow.

    RAID pertinence : fortunately, it is an expecteed case as most SATA disks are prone to HSA failure before not initialyzing at all. A RAID 1 mirror would have protected you from a mirrored defect accross the two disks.

    The RAID controller (true hardware controller like LSI/Avago or Microsemi) or even fake raid like Intel RST / VROC maintain data integrity accross the array's disks. The defective disk will raise bad blocks (that will get marked in metadata of the Raid Volume), but the others disks are fine and the data can be read safely. If too many errors are reported on a disk (very few in fact on most controllers), it will be labelled as failed and taken down from the array.

    • gruez 2 days ago ago

      >Disk errors logged in the system event log are from the I/O layer, low-level class driver (msahci.sys) / filter drivers. See Windows Storage Driver Architecture : https://learn.microsoft.com/en-us/windows-hardware/drivers/s...

      What filters in the event log would you apply to find such errors?

      • 2 days ago ago
        [deleted]
    • jwrallie 2 days ago ago

      If taking it offline is not a concern, I would try a low level backup with ddrescue while booting from external media as soon as possible.

      Keep using the system from a disk showing read issues could trigger loss of more data, and one could always back up the SQL from the backup image later.

    • r1chk1t 2 days ago ago

      Thank you for all that, learned a lot !

  • pixel_popping 2 days ago ago

    I feel the pain OP.

    Over the last decade, I've ran hundreds of servers if not thousands, and I entirely stopped using hard drives, now it's solely SSD/NVMe where the failure rate in practice is incredibly lower, I've had my fair share of middle-night runs because websites are offline or whatever to end-up in a hard drive diagnosis circus.

    Imo, the peace of mind you get worth the cost, it also allows you to rethink development entirely, typical example would be that suddenly, copying all node_modules or rust deps is a great idea with 10Gbit/s bandwidth and fast drives (yes, I expect people to shit on me for saying this, please give me the counterarguments if you downvote me), many things change if you have a higher base performance assumption, storage is relatively cheap as well. I would never advise anyone that wants to run continuously in prod with low friction to get servers with HDD.

    I get that for some use cases it's not possible, but for large majority of use cases, it's clearly not HDD that is the cost burden. $50 servers gets you TBs of SSD, of course don't go with VPS or "Cloud" if you intend to change your development based on new performance assumptions, it blows my mind the numbers of people paying thousand of dollars just to handle what, 100K visitors a day? That fits on a $100 server and a bunch of Kimsufi hosted across the world as a CDN.

    People are overcomplicating infrastructure, big time (which leads to more problems, higher maintenance, security issues and so on).

    • toast0 2 days ago ago

      > Over the last decade, I've ran hundreds of servers if not thousands, and I entirely stopped using hard drives, now it's solely SSD/NVMe where the failure rate in practice is incredibly lower, I've had my fair share of middle-night runs because websites are offline or whatever to end-up in a hard drive diagnosis circus.

      My experience is that (most) spinners give off reliable pre-failure indicators (if you take the time to look/script looking), but SSDs fail by disappearing from the bus. The SSDs do fail much less often, but they still fail from time to time and recovery is harder.

      Either way, if your data is important to you/your customers, you really need a backup/recovery plan.

      I dunno about recent pricing, but not so long ago, it felt like spinners had a pretty high price floor and SSDs didn't... If you don't need a lot of space, you could find a small SSD that was still around the same $/GB as a medium sized SSD, but for spinners, there's a floor in dollars and space. So if you don't need a lot of space, you save money with an SSD and get better perf for free... If you need a lot of space and not a lot of perf, big spinners are more attainable than big SSDs.

      • ryandrake 2 days ago ago

        > My experience is that (most) spinners give off reliable pre-failure indicators (if you take the time to look/script looking), but SSDs fail by disappearing from the bus. The SSDs do fail much less often, but they still fail from time to time and recovery is harder.

        I'm not a pro, just a smalltime dork with a homelab. I use cheap WD HDDs on my NAS system connected to an LSI hardware RAID controller. I'll boast that I have a 100% record so far of preventing downtime and data loss by simply listening for the controller's audible alarm and swapping drives right away (I keep brand new spares). I also have offline backups, but have so far never needed them. Not sure how this would change if I moved to SSDs.

        • adastra22 a day ago ago

          I have a homelab with 24 disks (2x 12-drive raidz3 pools). In the past 15 years of operation I have replaced many drives. Usually my experience matches yours, but there is one time I got a simultaneous 3-disk failure. One disk failed, then when I replaced it and did a scrub, two more from the same pool failed in quick succession. I had to scramble to get more spare drives, and didn't sleep much while I waited for the pools to finish the scrub.

          Coordinated failure is a thing, unfortunately. Drives bought together tend to fail together.

        • Felger 2 days ago ago

          Well, SAS disk tend to go in failed state immediately or very quickly, most of the time without going first through the warning state.

          SATA disk are indeed generally more predictable failure-wise. Most issues are related to a failing head stack assembly. Rarely platter demagnetization for some disks (Toshiba laptop).

          Other failure issues are usually related to a friggin' manufactured firmware issue from Dell, HP or Lenovo corp.

      • pixel_popping 2 days ago ago

        Agree with the diagnostic part.

        > Either way, if your data is important to you/your customers, you really need a backup/recovery plan.

        You'd be surprised at how many devs/companies walk on eggshells all the time (praying that the fatal moment never arrive) because they aren't "brave" enough to do a proper backup system, which is often few minutes/hours of setup only.

    • Retr0id 2 days ago ago

      It is quite remarkable how quickly a modern SSD can scan over TBs of data, I'm less afraid of O(n) queries than I used to be.

  • louwrentius 2 days ago ago

    > This disk was probably dying. I did some research, and a RAID wouldn’t have saved it either, RAID protects against drive failure, not against silent page corruption that gets faithfully replicated to every mirror.

    I dispute this was a 'silent' drive error as many systems reported read errors. Silent data corruption on hard drives is extremely rare, due to the tons of checksums used on all data. Maybe I'm wrong but I bet there are read errors on the drive in the appropriate system logs.

    I feel that people confuse regular 'bad blocks' with 'silent data corruption' and there is a huge difference[0].

    [0]: https://louwrentius.com/what-home-nas-builders-should-unders...

    • phoronixrly 2 days ago ago

      Agreed on the error not being silent. Also incorrect about RAID being unable to catch silent errors - it depends on the implementation - in Linux there's lvmraid that has the option to enforce integrity. There is also zfs which on top of everything else, has RAID functionality and integrity enforcement.

  • JoheyDev888 a day ago ago

    [dead]

  • codevark 2 days ago ago

    [dead]