Computer data storage - Reading SMART reports
|Computer data storage|
SMART is less useful than it pretends to be. There signs that sometimes strongly indicate that a careful person should replace the disk now -- yet large studies have found no consistent predictor.
...still, some of these values are better than nothing.
The best indicators of bad health are probably:
- 5 (0x05), Reallocated Sector Count:
- Amount of sectors that remapped by the HDD's firmware because it decided the original sectors are not viable anymore
- A slowly increasing figure likely means a failing surface
- A largeish figure (hundreds or thousands) likely means the same.
- A few, or even a few hundred, may be fine, if constant. (drives may have a few when sold, though typically those don't count towards this)
- Reallocations only happen when a drive writes to a pending sector, and fails to do so, so the number of pending sectors is also important to check (see 197 / 0xC5 below).
- ID 198 (0xC6), Uncorrectable Sector Count, a.k.a. Offline Uncorrectable:
- Apparently counts the amount of UNCs as they happen.
- when this also means the reallocated sector count is rising, the the drive is physically failing.
- Note that you may not notice present UNCs until you read every sector. If you want to notice these earlier rather than later, you want to occasionally trigger a SMART offline scan (not unlike a RAID scrub).
- Not to be confused with 187 / 0xBB, Reported Uncorrect
- SMART overall-health self-assessment test result: FAILED (or comparable if not using smartctl)
- note that FAILED is a lot more informative than PASSED
- ...because the disk itself reports this, which allows for some risky/misleading optimism (where pessimism is the correct response, as it gives you early warning that lets you saves your data)
Strong warning signals
- 197 (0xC5), Current Pending Sector Count: Counts sectors that showed a read error
- Basically "I couldn't read this, even though I tried pretty hard. It could be a mistake when this data was written, it could be that this sector can no longer hold data -- we'll know that when we next write to it. For now I'm remembering it."
- At the next write
- some read errors may turn out to be transient, so are removed from this count. There are relatively few reasons for purely-transient errors, though they do happen, and do not indicate failure
- More typically they are verifiably-bad sectors that will be remapped, at this point they will also be counted in 0x05. This usually means a failing drive.
- 196 (0xC4), Reallocation Event Count: counts the attempts at reallocation - both successful and failed ones. Often this is pretty redundant with 0x05 and 198/0xC6.
Weaker warning signs
- 187 (0xBB), Reported Uncorrect (a.k.a. UNC) (used only by some vendors): the count of read errors that could not be recovered automatically using ECC (see also 195/C3).
- These could be soft errors. Often they are not, but there are other fields that are more informative.
- Not to be confused with 198/0xC6 (Uncorrectable Sector Count a.k.a. Offline Uncorrectable)
- 10 (0x0A), Spin Retry Count: Retries necessary to spin up.
- If larger than zero, this can point to general mechanical problems (or insufficient power for spinup (verify))
- 1 (0x01), (Raw) Read Error Rate: Amount of times(verify) we had a problem reading data from the physical storage.
- Not necessarily sector-related errors, not necessarily uncorrectable, physical, or permanent errors.
- ...but a high value it means the drive is spending more time doing reads, and you certainly want to look at other indicators to get an idea of why - it could be failing.
- (Not to be confused with 13, Soft Read Error Rate)
- (Is a rate, and seems to be summarized over recent time)
- 199 (0xC7), UDMA_CRC_Error_Count - amount of incorrect transfers over the drive's cable, as noticed by CRC. (UDMA just refers to the time this was introduced)
- Often signals a bad connection - badly seated plug, corroded plug/socket, cable not up to spec, or whatnot.
- Since these are retried, a low count is typically fine. A high count makes it likelier that there were also errors we didn't notice, so should make you wary enough to look at the cables.
- 10 (0x0A), Spin Retry Count - if it has trouble spinning up, chances are the motor or mechanics are worn. Make sure you have recent backups.