Computer data storage - Reading SMART reports

From Helpful
Jump to: navigation, search
Computer data storage


SMART is less useful than it pretends to be. There signs that sometimes strongly indicate that a careful person should replace the disk now -- yet large studies have found no consistent predictor.

...still, some of these values are better than nothing.



Platter

The best indicators of bad health are probably:

  • 5 (0x05), Reallocated Sector Count:
Amount of sectors that remapped by the HDD's firmware because it decided the original sectors are not viable anymore
A slowly increasing figure likely means a failing surface
A largeish figure (hundreds or thousands) likely means the same.
A few, or even a few hundred, may be fine, if constant. (drives may have a few when sold, though typically those don't count towards this)
Reallocations only happen when a drive writes to a pending sector, and fails to do so, so the number of pending sectors is also important to check (see 197 / 0xC5 below).
  • ID 198 (0xC6), Uncorrectable Sector Count, a.k.a. Offline Uncorrectable:
Apparently counts the amount of UNCs as they happen.
when this also means the reallocated sector count is rising, the the drive is physically failing.
Note that you may not notice present UNCs until you read every sector. If you want to notice these earlier rather than later, you want to occasionally trigger a SMART offline scan (not unlike a RAID scrub).
Not to be confused with 187 / 0xBB, Reported Uncorrect
  • SMART overall-health self-assessment test result: FAILED (or comparable if not using smartctl)
    • note that FAILED is a lot more informative than PASSED
    • ...because the disk itself reports this, which allows for some risky/misleading optimism (where pessimism is the correct response, as it gives you early warning that lets you saves your data)


Strong warning signals

  • 197 (0xC5), Current Pending Sector Count: Counts sectors that showed a read error
Basically "I couldn't read this, even though I tried pretty hard. It could be a mistake when this data was written, it could be that this sector can no longer hold data -- we'll know that when we next write to it. For now I'm remembering it."
At the next write
some read errors may turn out to be transient, so are removed from this count. There are relatively few reasons for purely-transient errors, though they do happen, and do not indicate failure
More typically they are verifiably-bad sectors that will be remapped, at this point they will also be counted in 0x05. This usually means a failing drive.
  • 196 (0xC4), Reallocation Event Count: counts the attempts at reallocation - both successful and failed ones. Often this is pretty redundant with 0x05 and 198/0xC6.


Weaker warning signs

  • 187 (0xBB), Reported Uncorrect (a.k.a. UNC) (used only by some vendors): the count of read errors that could not be recovered automatically using ECC (see also 195/C3).
These could be soft errors. Often they are not, but there are other fields that are more informative.
Not to be confused with 198/0xC6 (Uncorrectable Sector Count a.k.a. Offline Uncorrectable)
  • 10 (0x0A), Spin Retry Count: Retries necessary to spin up.
If larger than zero, this can point to general mechanical problems (or insufficient power for spinup (verify))
  • 1 (0x01), (Raw) Read Error Rate: Amount of times(verify) we had a problem reading data from the physical storage.
Not necessarily sector-related errors, not necessarily uncorrectable, physical, or permanent errors.
...but a high value it means the drive is spending more time doing reads, and you certainly want to look at other indicators to get an idea of why - it could be failing.
(Not to be confused with 13, Soft Read Error Rate)
(Is a rate, and seems to be summarized over recent time)



Other problems:

  • 199 (0xC7), UDMA_CRC_Error_Count - amount of incorrect transfers over the drive's cable, as noticed by CRC. (UDMA just refers to the time this was introduced)
Often signals a bad connection - badly seated plug, corroded plug/socket, cable not up to spec, or whatnot.
Since these are retried, a low count is typically fine. A high count makes it likelier that there were also errors we didn't notice, so should make you wary enough to look at the cables.
  • 10 (0x0A), Spin Retry Count - if it has trouble spinning up, chances are the motor or mechanics are worn. Make sure you have recent backups.




SSD