Computer data storage - Reading SMART reports: Difference between revisions

Latest revision as of 16:55, 20 April 2024

Computer data storage

Failure, error, and how to deal (concepts)
Noticing errors and failure
- Reading SMART reports
Partitioning and filesystems
- ZFS notes
Network storage
RAID notes
- mdadm notes, aacraid notes, OMSA notes, LSI notes
General & RAID performance tweaking
SSD notes
LVM notes
Some glossary
Semi-sorted

Intro

SMART is dumber than it pretends to be.

Yes, it has signs that sometimes strongly indicate that a careful person should replace the disk now -- yet large studies have found no consistent predictor.

...still, some of these values are better than nothing.

Platter

The best indicators of bad health are probably:

5 (0x05), Reallocated Sector Count:

Amount of sectors that remapped by the HDD's firmware because it decided the original sectors are not viable anymore

A slowly increasing figure (of any magnitude) likely means a failing surface

A largeish figure (hundreds to thousands) likely means the same.

A few, or even a few hundred, might be fine, if constant over a long time. (drives may have a few when sold, though those typically don't count towards this)

Reallocations only happen when a drive writes to a pending sector, and fails to do so, which means the number of pending sectors is also important to check (see 197 / 0xC5 below).

198 (0xC6), Uncorrectable Sector Count, a.k.a. Offline Uncorrectable:

Apparently counts the amount of UNCs as they happen.

when this also means the reallocated sector count is rising, it is likely the drive is starting to physically fail.

Note that you may not notice present UNCs until you read every sector. If you want to notice these earlier rather than later, you want to occasionally trigger a SMART offline scan (not unlike a RAID scrub).

Not to be confused with 187 / 0xBB, Reported Uncorrect

Medium warning signals

197 (0xC5), Current Pending Sector Count: Counts sectors that showed a read error

Roughly means "I couldn't read this, even though I tried pretty hard. It could be a mistake when this data was written, it could be that this sector can no longer hold data -- we'll know that when we next write to it. For now I'm remembering it."

At the next write

some read errors may turn out to be transient, so are removed from this count. There are relatively few reasons for purely-transient errors, yet they do happen, and then do not indicate failure

More typically they are verifiably-bad sectors that will be remapped, at this point they will also be counted in 0x05. You then want to watch it because it can mean a failing drive.

196 (0xC4), Reallocation Event Count: counts the attempts at reallocation - both successful and failed ones. Often this is pretty redundant with 0x05 and 198/0xC6.

Weaker warning signs

187 (0xBB), Reported Uncorrect (a.k.a. UNC) (used only by some vendors): the count of read errors that could not be recovered automatically using ECC (see also 195/C3).

These could be soft errors. Often they are not, but there are other fields that are more informative.

Not to be confused with 198/0xC6 (Uncorrectable Sector Count a.k.a. Offline Uncorrectable)

10 (0x0A), Spin Retry Count: Retries necessary to spin up.

If larger than zero, this can point to general mechanical problems (or insufficient power for spinup (verify))

1 (0x01), (Raw) Read Error Rate: Amount of times(verify) we had a problem reading data from the physical storage.

Not necessarily sector-related errors, not necessarily uncorrectable, physical, or permanent errors.

...but a high value it means the drive is spending more time doing reads, and you certainly want to look at other indicators to get an idea of why - it could be failing.

(Not to be confused with 13, Soft Read Error Rate)

(Is a rate, and seems to be summarized over recent time)

Other problems:

199 (0xC7), UDMA_CRC_Error_Count - amount of incorrect transfers over the drive's cable, as noticed by CRC. (UDMA just refers to the time this was introduced)

Often signals a bad connection - badly seated plug, corroded plug/socket, cable not up to spec, or whatnot.

Since these are retried, a low count is typically fine. A high count makes it likelier that there is also a real error among them, that you may not have noticed. So should make you wary enough to e.g. look at the cables.

10 (0x0A), Spin Retry Count - if it has trouble spinning up, chances are the motor or mechanics are worn. Make sure you have recent backups.

SSD

✎ This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

General

@@ Line 1: / Line 1: @@
+{{#addbodyclass:tag_tech}}
 {{ComputerHardDrives}}
@@ Line 33: / Line 34: @@
-===Platter===
+==Platter==
 The best '''indicators of bad health''' are probably:
@@ Line 190: / Line 191: @@
 -->
+==General==
 [[Category:Computer‏‎]]
 [[Category:Hardware]]

Computer data storage - Reading SMART reports: Difference between revisions

Latest revision as of 16:55, 20 April 2024

Contents

Intro

Platter

SSD

General

Navigation menu