Computer data storage - Reading SMART reports: Difference between revisions
Jump to navigation
Jump to search
mNo edit summary |
mNo edit summary |
||
(2 intermediate revisions by the same user not shown) | |||
Line 1: | Line 1: | ||
{{#addbodyclass:tag_tech}} | |||
{{ComputerHardDrives}} | {{ComputerHardDrives}} | ||
Line 33: | Line 34: | ||
==Platter== | |||
The best '''indicators of bad health''' are probably: | The best '''indicators of bad health''' are probably: | ||
Line 190: | Line 191: | ||
--> | --> | ||
==General== | |||
[[Category:Computer]] | [[Category:Computer]] | ||
[[Category:Hardware]] | [[Category:Hardware]] |
Latest revision as of 16:55, 20 April 2024
Computer data storage |
Intro
SMART is dumber than it pretends to be.
Yes, it has signs that sometimes strongly indicate that a careful person should replace the disk now -- yet large studies have found no consistent predictor.
...still, some of these values are better than nothing.
Platter
The best indicators of bad health are probably:
- 5 (0x05), Reallocated Sector Count:
- Amount of sectors that remapped by the HDD's firmware because it decided the original sectors are not viable anymore
- A slowly increasing figure (of any magnitude) likely means a failing surface
- A largeish figure (hundreds to thousands) likely means the same.
- A few, or even a few hundred, might be fine, if constant over a long time. (drives may have a few when sold, though those typically don't count towards this)
- Reallocations only happen when a drive writes to a pending sector, and fails to do so, which means the number of pending sectors is also important to check (see 197 / 0xC5 below).
- 198 (0xC6), Uncorrectable Sector Count, a.k.a. Offline Uncorrectable:
- Apparently counts the amount of UNCs as they happen.
- when this also means the reallocated sector count is rising, it is likely the drive is starting to physically fail.
- Note that you may not notice present UNCs until you read every sector. If you want to notice these earlier rather than later, you want to occasionally trigger a SMART offline scan (not unlike a RAID scrub).
- Not to be confused with 187 / 0xBB, Reported Uncorrect
Medium warning signals
- 197 (0xC5), Current Pending Sector Count: Counts sectors that showed a read error
- Roughly means "I couldn't read this, even though I tried pretty hard. It could be a mistake when this data was written, it could be that this sector can no longer hold data -- we'll know that when we next write to it. For now I'm remembering it."
- At the next write
- some read errors may turn out to be transient, so are removed from this count. There are relatively few reasons for purely-transient errors, yet they do happen, and then do not indicate failure
- More typically they are verifiably-bad sectors that will be remapped, at this point they will also be counted in 0x05. You then want to watch it because it can mean a failing drive.
- 196 (0xC4), Reallocation Event Count: counts the attempts at reallocation - both successful and failed ones. Often this is pretty redundant with 0x05 and 198/0xC6.
Weaker warning signs
- 187 (0xBB), Reported Uncorrect (a.k.a. UNC) (used only by some vendors): the count of read errors that could not be recovered automatically using ECC (see also 195/C3).
- These could be soft errors. Often they are not, but there are other fields that are more informative.
- Not to be confused with 198/0xC6 (Uncorrectable Sector Count a.k.a. Offline Uncorrectable)
- 10 (0x0A), Spin Retry Count: Retries necessary to spin up.
- If larger than zero, this can point to general mechanical problems (or insufficient power for spinup (verify))
- 1 (0x01), (Raw) Read Error Rate: Amount of times(verify) we had a problem reading data from the physical storage.
- Not necessarily sector-related errors, not necessarily uncorrectable, physical, or permanent errors.
- ...but a high value it means the drive is spending more time doing reads, and you certainly want to look at other indicators to get an idea of why - it could be failing.
- (Not to be confused with 13, Soft Read Error Rate)
- (Is a rate, and seems to be summarized over recent time)
Other problems:
- 199 (0xC7), UDMA_CRC_Error_Count - amount of incorrect transfers over the drive's cable, as noticed by CRC. (UDMA just refers to the time this was introduced)
- Often signals a bad connection - badly seated plug, corroded plug/socket, cable not up to spec, or whatnot.
- Since these are retried, a low count is typically fine. A high count makes it likelier that there is also a real error among them, that you may not have noticed. So should make you wary enough to e.g. look at the cables.
- 10 (0x0A), Spin Retry Count - if it has trouble spinning up, chances are the motor or mechanics are worn. Make sure you have recent backups.
SSD
✎ This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.