Computer hard drives

From Helpful

These are primarily notes
This is probably not going to be complete in any real sense, and exists to contain bits of useful information.

Contents

Terms, concepts, and some history

IDE

IDE: Integrated Drive Electronics, a hardware bus primarily used for (parallel) ATA and its protocol, and also before the ATA standard, and for some things not (initially or at all) part of it.

Informally, IDE it now refers most often to 40-pin connectors, to distinguish those from SATA.

IDE and ATA were informally used interchangably and abused enough for it to be ambiguous. The now common (but somewhat arbitrary) distinction between SATA and IDE(-as-in-parallel) is convenient enough, but you should avoid the term IDE when you want to refer to actual standards.

EIDE

EIDE: Enhanced IDE, refers partly to early adoption of some features of a then-future ATA standards. (the term was also abused at that time for some other ends, apparently mainly for marketing reasons)

ATA

ATA: AT Attachment, Advanced Technology Attachment. A set of standards and a family of attachments primarily used to connect hard drives, CD drives and such. Its various versions and parts of these standards encompass Parallel ATA, ATAPI, Serial ATA (SATA), and more.


ATA versions and introduced features:

  • pre-ATA standards had varied limits, particularly in the DOS rea
  • ATA-1 (1994) - PIO 1 and 2, 28-bit LBA for sizes up to 137GB
  • ATA-2 (1996) - PIO 3 and 4, DMA 1 and 2 (the terms 'EIDE' and 'Fast-ATA' appeard around this time)
  • ATA-3 (1997) - SMART, connector for 2.5" drives
  • ATA-4 (1998) - UDMA 0, 1, 2 ('UDMA 33')
  • ATA-5 (2000) - UDMA 3, 4 ('UDMA/66')
  • ATA-6 (2002) - UDMA 5 ('UDMA/100'), 48-bit LBA for sizes up to 144 PB
  • ATA-7 (2005) - UDMA 6 ('UDMA/133'), SATA 1.0 ('SATA/150')
  • ATA-8 (in progress)


PATA

PATA: Parallel ATA, a retronym coined after SATA was introduced, to refer to the 40-pin connector (defined in all ATA standards(verify)) that is now also commonly referred to as IDE.

Note: If there are two devices on a single cable, one is called master and the other the slave (jumpered that way, or sometimes implied via cable select). These names are not official, and they do not refer to interaction or control between the drives - both drives operate independently, using their own controllers.

SATA

SATA: Serial ATA, part of ATA-6 and later: The interface now becoming standard, with the smaller plug (than PATA) with red cables.

SATA-related terms:

  • SATA/150, SATA/300: The current SATA standards
  • SATA II: the committee defining SATA. Because of widespread (ab)use of 'SATA II' to refer to 3Gb/s devices (SATA/300), that comittee has been renamed to SATA-IO
  • SATA 1 or 1.0 and SATA 2 or 2.0 are non-official names for SATA/150 and SATA/300
  • eSATA: Connector for SATA connection outside of enclosures. Same pinout, but the connectors are made so that internal cables cannot be used externally (useful because the cables/interface use slightly different electronic and physical specs -- more suitable for external purposes)

Drive / interface features

SMART

S.M.A.R.T.: Self-Monitoring Analysis and Report Technology, meant to monitor drive health and warn about imminent failure.

(Compliance with this standard possibly seems to vary per manufacturer, and also probably somewhat per drive design)(verify)

Command queueing

Command queueing refers to planning an order of read and/or write commands order so that head position and rotational position are taken into account. In the best case a series of operations finishes faster, in the worst case it makes no difference and the overhead (fairly little for NCQ, while for TCQ it depends somewhat on the implementation/technology) makes things a little slower. The practical value of command queueing depends largely on the average and worst case influence.

  • TCQ: Tagged Command Queueing: Done by the driver (verify)
  • NCQ: Native Command Queueing: Done by the drive (SATA feature, optional?(verify)).


Somewhat relevant

Also somewhat relevant:

ATAPI

ATAPI: ATA extension based on SCSI protocol features, that made it more useful for certain (additional) drive types, such as CD and tape drives.


Panasonic / Sony / Mitsumi interface

Panasonic, Sony, and Mitsumi interface: Before ATAPI caught on, these companies created proprietary interfaces (sometimes called AT-BUS and other things), seen most often on old sound cards as two or three connectors that were used to connect these pre-ATAPI drives. Sony used a 34-pin ribbon connector, Panasonic and Mitsumi used 40-pin ribbon connectors (that were potentially confusable with IDE connectors).


ARMD

ARMD: ATAPI Removable Media Device, used to refer to ATAPI drives other than CD/DVD drives, such as tape drives.


Errors and messages

common causes of hard drive related errors

  • controller problems -- try drive in other computers
  • cabling -- try another data cable (and perhaps power cable), and perhaps in another computer
  • failing drive -- look at SMART info (but note this is not always reliable)
  • marginal power supply (note that power rating alone gives no guarantees) -- you could test for this by seeing if you see less errors logged when running less hardware, or use a different power supply.
  • specific driver problems -- try a newer/older driver/kernel to see if there is a difference
  • there are controllers with known limitations, sometimes in specific situations such as in external hard drive cases which may not recognize drives of a certain (large) size. (currently mostly SATA/150 controllers -- because those are the most common outside of motherboards)
  • There are some drives that speak only SATA/300 (informally 'SATA 2') and not /150, even though /300 is supposed to be backwards compatible with SATA/150. In some cases you can jumper the drive to behave like a SATA/150 one, in other cases you may need a SATA/300 controller for it (which by now are fairly common).

In linux

DMA intr status 0x51, error 0x84

Seeing the following in your log:

hda: dma_intr: status=0x51 { DriveReady SeekComplete Error }
hda: dma_intr: error=0x84 { DriveStatusError BadCRC }

Or, in libata something like:

ata2.00: tag 0 cmd 0xc8 Emask 0x10 stat 0x51 err 0x84 (ATA bus error)


...means high-speed UDMA transfers are failing. These are warnings, not errors, and it doesn't mean anything went bad unless the fallbacks fail - which is rare, and if it happens would be logged fairly immediately after.

If you see only the above pairs of lines, all that happened is that the drive will be set to a lower speed (lower UDMA mode, possibly even PIO), which will also be logged (immediately after) until the operation succeeded.

Most of the time you will see these lines when the bootup process enables DMA transfers on specific drives.


Apparently, the main cause is that there is too much noise in the (parallel IDE) cable for a certain UDMA mode, most of the time because it is a cable not guaranteed for a particular speed, or just a low quality cable.


(Other causes may or may not include incompatible controllers on the same cable and too little power supplied to your hard drive, or a failing hard drive. (verify))

libata messages

EH refers to error handling. When you see it in logs, it doesn't need to refer to an error. It could be that EH is choosing a slower, more basic and robust transfer method, or is handling an interface reset, both of which are generally transparent to apps other than in delays.

In other cases, EH may beis verbosely mentioning the various details about a drive failing.

EH actually runs fairly frequently, but only logs at all when there is something worth mentioning.


Informational messages, non-fatal error

A drive initializing, e.g. being recognized at bootup, looks something like:

SCSI device sda: 976773168 512-byte hdwr sectors (500108MB)
sda: Write Protect is off
sda: Mode Sense: 00 3a 00 00
SCSI device sda: drive cache: write back

this is just telling you things about the drive and access mechanisms.


Similarly, a port (re)initializing looks something like:

ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
ata3.00: configured for UDMA/133


If a port or device initialization happens more than once for a device, this probably means error handling kicked in and has just reset it (look for EH complete and possibly a soft resetting port in the logs).

If the reason seems valid (e.g. a device was plugged out and possibly in again), this is not a problem.


On drive errors, bad blocks, and more

This article/section is a stub — probably a pile of half-sorted notes and assertions some of which may well be wrong, and not verified as a whole. Feel free to add or refine.

Bad blocks/sectors refers broadly to any problems that leads to areas of the disk not storing or reading data properly. This is usually identified by whatever low-level error checking that the drive uses under the covers (something like EEC) and its report of failure to the OS. Applications exist to check, usually based on reading (and possibly writing) the entire disk and relying on the hardware/OS reporting such errors to them.


A hard error/failure refers to the underlying cause being a physically unreliable area (can be caused by mishanding while in operation, say, temperature, shocks, or such). A soft error/failure refers to things failing for reasons not clearly physical, and possibly quite transient, and may for example happen when your power supply was iffy at the time of writing(verify).

Both hard and soft errors mean your data is likely gone, but only hard failures directly mean your hardware is probably scrap. The practical difference lies mostly in the reusability of the physical area. Put another way, it isn't yet sure whether this is only a read error or will also mean a write error.

Both hard and soft errors will likely persist while you're only asking to read the current data (Note that this includes a non-destructive read-write test, as it tries to read data first, and usually the entire problem is that it consistently can't).

Note that this means the term 'bad blocks' groups multiple issues: seeing if we can read the data (probably not, but it happens), and seeing whether we can reliably write data in the sector (and what exactly to do if not).


At the time of a sector read error, the drive will report that error (to the OS), after the drive firmware will also only lists it as unstable and pending to be remapped.

Remapping means the drive will back a logical sector with a spare sector. (Drive firmware has the job of mapping block number to physical location, and while they store things linearly on-disk because that's good for seek time and therefore speed, they don't have to. They can and do keep an exception list so they can choose to back a known-to-be-unreliable block from good spare capacity reserved purely for such reallocation (usually a few thousand sectors worth)(verify).


Pending to be remapped means that it knows there is a problem, but it hasn't resolved it one way or the other - that will only happen when you attempt to write data to the sector, because the drive considers the case that this might turn out to be a soft/transient error. If it is soft, it is removed from the pending-to-be-remapped list and taken back into regular use. Soft errors will not be remapped, but will stay reported as pending blocks until you write new data to them.

If a write fails (repeatedly), the drive will probably decide it's a hard failure sector and will remap it if it can (probably the only reason it can't is if the spare sectors are all used up, in which case the block becomes permanently exposed to filesystems/users).


This also means a destructive bad-block test (or low-level format) may fix transient errors that a read-only and a non-destructive read-write will trip over.

If read-write checks don't complain, there are either no errors or they all got fixed. If SMART tells you all pendings disappeared and there is no remap sector count, you're probably okay for now - although it's risky to assume the drive will last another couple of years unless there is a specific reason for the pendings.




diagnosis and recovery

This article/section is a stub — probably a pile of half-sorted notes and assertions some of which may well be wrong, and not verified as a whole. Feel free to add or refine.

(mostly linux-specific)


First, understand the summary above, and know that in many situations, you should give up on continuing to use the drive, and only care about recovering the data you can.

If you have good reason to believe the all errors are soft (e.g. you have just replaced a bad power supply), or there is a specific reason only a few blocks are broken (maybe if you hit your laptop while it was reading/writing), you have reasonable certainty that the drive will last a while longer (and not slowly fail and eat more data - but in the dropped drive case it pays to be safe rather than sorry).


Trouble check

This article/section is a stub — probably a pile of half-sorted notes and assertions some of which may well be wrong, and not verified as a whole. Feel free to add or refine.

To test whether a drive is giving trouble:


Make the drive do SMART self-tests and see if the report in the SMART logs shows any errors. A smartctl -t short will take a minute or two and may quickly revealt that there is at least one error. For a drive-wide check you'll want smartctl -t long.


Use badblocks somewhat like a long/offline test

  • read-only test is (somewhat) faster than a non-destructive read-write
  • (destructive read-write will destroy all data, an may fix soft errors)

Getting data from the drive

This article/section is a stub — probably a pile of half-sorted notes and assertions some of which may well be wrong, and not verified as a whole. Feel free to add or refine.

Before you attempt any sort of recovery that involves attempted correction, writing, or such, you should probably back up your data.

If the errors are minor, you could probably just move/copy files to another drive. {{{1}}}(from start of device? which can be a partition?), for example:

85066725
85066726
85066727

If it says nothing, you're probably fine.


(Note: mkfs and fsck tools may also be able to run badblocks or use its reports.)


On smartctl options

Output options:

-a / --all means verbose output (Shows generic, vendor-specific specific details, and shows various stored logs.). For ATA devices this is equivalent to: -H -i -c -A -l error -l selftest -l selective


-H: basic health check, OK-or-not -- mostly whether the drive has failed, or probably will within a day or so. Not the most reliable feedback, apparently partly because drive vendors abuse SMART thresholds and transforms so that drives may not alert even if they're completely dead.

-i: Prints model number, serial number, firmware version, ATA version, SMART support / enabled state, possibly drive model family, power mode.

-A: prints vendor specific attributes (that is, the various things it logs/counts like Raw_Read_Error_Rate, Spin_Up_Time, Power_On_Hours, and such, and then the set that the specific drive actually tracks, which varies between models, brands and such)


Test/log options

-l: show log. Probably most interestingly:

 -l selftest: most recent self-tests logges (executed via -t)
 -l error: most recent errors logged
 -l selective: some drives have (location?(verify)-)selective tests.


-t: start self-test. Probably most interestingly:

 -t offline: 
 -t short: (a few minutes)
 -t long: (dozens of minutes up to a few hours)
 -t conveyance: (a few minutes. ATA only.) 

Most tests are off-line (meaning they will happen transparently and without risk or bother) unless you also specify -C (captive), which will busy out the device, meaning you should not do this when you have anything mounted from the drive.


Drive options:

-d ata is often redundant

Use of -F to may sometimes be useful, but is often automatically selected so also usually redundant.


Reading SMART reports
This article/section is a stub — probably a pile of half-sorted notes and assertions some of which may well be wrong, and not verified as a whole. Feel free to add or refine.

A few notes:

  • SMART vendor attributes are usually the most informative - but only a few are really interesting.
  • You should also know that the value/worst/threshold numbers -- those things in the 0..100 range -- are taken or calculated from the raw figure by the disk firmware, and they are sometimes abused to be almost completely useless.

In general, they are indicators at best and only the raw figure is directly meaningful.

  • the attribute numbering is fixed (but some attributes are vendor-specific) while the the names can vary a little.
  • some disks only update some of their vendor attributes after, for example, a smartctl -t offline (which will take a while).



The attributesw most representative of drive health are probably:

  • ID 5 (0x05), Reallocated Sector Count: amount of sectors remapped by the firmware, as previously mentioned. A few is acceptable, but more than a few can indicate a deteriorating drive, and a slowly increasing figure likely does.
  • ID 198 (0xC6), Uncorrectable Sector Count: Seems to mean errors that could not be remapped(verify) (calculated/updated (only?) in offline scans)


Serious warning signs:

  • ID 1 (0x01), (Raw) Read Error Rate: Amount of times we had a problem while reading data from the drive (amount of times we had a problem, not amount of sectors we had it on). Not necessarily sector-related errors, not necessarily physical/permanent errors. Still, if you see these you want to look at its cause and the important indicators listed here.
  • ID 197 (0xC5), Current Pending Sector Count: Counts sectors that showed a read error, pending to be remapped. Not necessarily a problem - if they can be successfully read/written the next time they are soft/transient errors and disappear from this count. However, it's also possible they are real bad sectors that will become reallocated (counted in 5).
  • ID 196 (0xC4), Reallocation Event Count: counts the attempts at reallocation - both successful and failed ones - so is a more secondary indicator that means you should look at 5 and 198/C6.



Also bad indicators:

  • ID 10 (0x0A), Spin Retry Count: Retries necessary to spin up. If larger than zero, this can point to more general mechanical problems (or insufficient power for spinup at some point(verify))



ddrescue and such

This article/section is a stub — probably a pile of half-sorted notes and assertions some of which may well be wrong, and not verified as a whole. Feel free to add or refine.

Copy all readable data off a drive to somewhere else.

Note that read errors slow reading down to near-zero speed - and that copying an entire drives would take hours even without that. This will take a while.


There seem to be a few variations on this theme:

  • ddrescue (GNU)
  • dd_rescue (suse?)
  • myrescue
  • probably more


Unsorted

This article/section is a stub — probably a pile of half-sorted notes and assertions some of which may well be wrong, and not verified as a whole. Feel free to add or refine.

Test transfer speeds: Cached reads come from the disk's data cache so should go at interface speed (easily 1GB/s), buffered disk read the disk/transfer speed (often somewhere between 30 and 120MB/s for platter drives, depending on the drive). The latter is useful in determining whether a high-speed mode to fit a drive's abilities is being used.

hdparm -tT /dev/sdb


Microsoft's NFI.exe utility can tell you which file is backed by a particular sector. This can be handy to see which files in a rescued image are damaged.


Drive wear and spindown

This article/section is a stub — probably a pile of half-sorted notes and assertions some of which may well be wrong, and not verified as a whole. Feel free to add or refine.



Unsorted