Computer data storage

Failure, error, and how to deal (concepts)
Noticing errors and failure
- Reading SMART reports
Partitioning and filesystems
- ZFS notes
Network storage
RAID notes
- mdadm notes, aacraid notes, OMSA notes, LSI notes
General & RAID performance tweaking
SSD notes
LVM notes
Some glossary
Semi-sorted

✎ This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

AAC-RAID - *nix driver for adaptec RAID controllers.

arcconf is the command-line tool (replaces earlier ones, including afacli). You should be able to do most management with it, though you'll probably want some documentation at your side.

Adaptec Storage Manager (ASM) is the GUI counterpart. It consists of two parts:

The agent, a (networked) background service that talks to the driver.
The manager, a GUI that talks to the agent, and can connect to remote agents, so allows more centralized management.

There are older versions floating around. More recent versions (circa 2012) for linux are around v7.31.18856

ASM can be a little finicky, but most of its quirks and workarounds are known well enough.

General notes:

PD refers to a physical device. A hard drive. You can hold it. Someone probably spent ages screwing them all in.

LD refers to a logical device (see also LUN) - the things that the array actually exposes to your OS.

This may be a single drive passed through as-is, or a RAID array/set made of many drives.

A hot spare can be assigned to one or more LDs, which seems to blend the local/global hot spare distinction. (verify)

Initial setup

Setting up PDs and LDs is easier via the card's BIOS utility.

Using arcconf takes a little bit more learning curve, though there are enough tutorials out there.

On scrubbing

The background consistency check seems not to be enabled by default. You can check whether it is enabled:

arcconf getconfig 1 AD

If you want it enabled: (and this is recommended if you like your data)

arcconf datascrub 1 on

...for the default 30 days, or set a different period, e.g. bi-weekly with:

arcconf datascrub 1 period 14

If you want more control over the time of execution: what it triggers is basically a:

arcconf task start 1 logicaldrive 0 verify_fix

The background task defaults to low priority (so won't affect performance too much, but will easily take a day), a verify_fix to high. If you want to change this, do something like:

arcconf setpriority 1 taskID high

Where you can read off taskID from:

arcconf getstatus 1

Inspection, and dealing with failed disks

✎ This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

Array/disk overview

arcconf GETCONFIG gets the basic state of arrays (LDs) and drives (PDs), useful to inspect general health.

arcconf GETCONFIG 1

...which defaults to show various sections:

AD: adapter (controller) info, including some overall settings, versions of firmware, driver.
LD: Logical device info (the composed drives that are exposed to the system)
- if one drive fell out of the array, that may not be very clear until you count the 'segment information' lines
PD: Physical device info (the actual disks)

It may tell you an LD has failed or is degraded, but in many cases does not say why. For that you probably want arcconf GETLOGS (see below), and even then you may have to dig a bit.

Showing currently running tasks (rebuild, verify, etc.) and their progress:

arcconf GETSTATUS 1

Beeping

If you're next to a server that's beeping (and not in server-room headgear), you and your sanity may like to silence the current alarm (it'll beep again on the next one):

arcconf SETALARM 1 SILENCE

Disk replacement commands

General notes:

channel+device numbers are not the same as the slot number.
After you physically remove or add a disk, it is usually prudent to make the controller rescan (arcconf RESCAN 1), so that it will realize the changes. (A rescan checks for removal of drives, and whether there are any new drives. Expect it to pause the OS's IO for 10 to 20 seconds, and that it may beep during most of that)

Optional before you pull the old disk: inspect that it is indeed broken, e.g. checking arcconf logs for the reason, and perhaps check its SMART state with smartctl

(In most cases it will only be kicked out for being broken, so this is somewhat optional)

Remove the disk from the logical array
- If you want to kick out a drive yourself (for example because it's racking up bad sectors and you'd rather not take the chance that it corrupts something before it fails): arcconf setstate 1 DEVICE Channel# Device# ddd
- If you waited for the controller to kick disks out, it's already gone

Remove the disk physically

you can making sure you'll be pulling the right drive by making the slot blink: arcconf IDENTIFY 1 DEVICE channel device#

rescan

If you didn't have a hot spare in the array:
- insert it physically
- rescan
- Optional: CLEAR any existing metadata (most RAID won't touch a disk with existing partitions, at least not automatically)(verify)
- Optional: VERIFY the new disk's surface for errors. (TODO: is verify or verify_fix more sensible?)
  - Note that if it's going to be a hot spare, you have the time anyway
- mark it as a hot spare to get it eligible for rebuilding - arcconf setstate 1 DEVICE Channel# Device# rdy

Get it rebuilding onto a new disk
if you have a disk marked as a hot spare, and it is not yet rebuilding onto it, trigger a rebuild (see below)

eventually remove the hot spare state status from the drive that is now functionally just an Online member (you can't do this while it's still rebuilding). This is optional functionally, but very handy for your overviews, so that you don't think you have more hot spares than you actually do.

arcconf setstate 1 DEVICE Channel# Device# rdy

If it is not yet rebuilding

A rebuild is triggered by having a hot spare present, and an array that actively needs one.

If it is not yet rebuilding, things you can check include:

...is the failover necessary?

I'm not entirely sure how the controller's decision process works, but presumably it waits until you pull a Failed/Offline disk (verify) (e.g. to avoid cases where pulling the wrong disk causes a rebuild)

...is there a PD that is a Hot Spare?

Check the PD listing:

arcconf GETCONFIG 1 PD | less

If not, do a SETSTATE HSP on the new drive you added.

Also consider that hot spares may have been assigned to a specific LD, or not. I don't yet know how to inspect this(verify), but if in doubt you can SETSTATE RDY and SETSTATE HSP it again without mentioning a LOGICALDRIVE.

...is there another task?

For example, an ongoing scrub won't be canceled by a need for rebuild.

...is the PD the right size?

Different drive models sometimes have slightly different amount of blocks. Being a tiny amount smaller can be the reason it can't be included into the array.

...is automatic failover enabled?

It typically is (since it's on by default), but it can't hurt to check - with:

arcconf getconfig 1 | grep -i failover

If it isn't, enable it with:

arcconf FAILOVER 1 on

States and tasks

PD states:

Ready - basically meaning "unused"
- not a hot spare, initialized(verify)
- ...and often specifically "not yet part of an LD"
- Apparently, when a Hot Spare is kicked out for having bad sectors, it becomes Ready(verify)

Hot Spare
- can be used when a drive is kicked out. May be local (assigned to be used by a specific LD) or global (usable by any by any)

Online - an active part of an LD

Failed - no longer active part of an LD (often because of medium errors, unreachability, or such(verify))
- not considered abandonable for a spare. (When you pull it, it will fail over) (verify)

Offline - no longer active part of an LD

LD segment states - essentially the state of the data on the disks

Present
Rebuilding
Inconsistent (note: PD state may still be be Online)
Missing (probably means kicked out because of timeout? Or physically pulled?)

LD states:

Optimal - Things couldn't be better (no )
Sub-optimal (a.k.a. "Suboptimal, Fault Tolerant"?) - we're working, and not degraded, but there's a drive missing (e.g. 1 disk gone from a RAID6 segment - not degraded, not optimal)
Degraded - slower because of missing drives, and can probably not stand another drive failure
Failed
Rebuild(verify)
Impact - created, but initial build did not complete (verify). You'll want a verify_fix.

On SETSTATE

arcconf SETSTATE 1 DEVICE Channel# device# State [LOGICALDRIVE LD# [LD#...]]}}

...sets a PD's state. to one of:

HSP - make it a (global?) hot spare
RDY - remove spare designation (from either a standby (Ready) )
DDD - force it offline

Offline             --task start initialize ---makes it--> Ready
Ready               --SETSTATE hsp          ---makes it--> Hot Spare(+Ready)
Hot Spare(+Ready)   --SETSTATE rdy          ---makes it--> Ready
Hot Spare(+Online)  --SETSTATE rdy          ---makes it--> Online   
Ready or Hotpspare  --SETSTATE ddd          ---makes it--> Failed (or Offline)?(verify))

When clearing hot-spare state with RDY, LOGICALDRIVE part should match how you made it a HSP -- globally or locally. If you get "The specified spare is NOT of type: DHS", then you probably made it a global hot spare, and are now mentioning an LD. (verify)

Tasks

Showing currently running tasks (rebuild, verify, etc.) and their progress:

arcconf getstatus 1

Tasks on PDs:

arcconf task start 1 DEVICE Channel# Device# verify_fix

Where the last is a task, one of:

verify - verify disk media (for bad sectors). There's rarely a reason not to use verify_fix instead(verify)
verify_fix - verify disk media, try to fix
initialize - removes LD metadata, which restores it to Ready state
clear - removes all data, including metadata
secureerase

Tasks on LDs:

arcconf task start 1 LOGICALDRIVE LD# verify_fix

Where the last is a task, one of: LD tasks

verify_fix
verify
clear (but you may not ever want to do that...)

You will see current tasks via

arcconf GETSTATUS 1

GETLOGS

When you're digging for reasons:

arcconf GETLOGS 1 DEVICE | DEAD | EVENT [CLEAR | TABULAR]
- DEAD basically lists the time at which a drive failed to be part of the LD. It gives date+time of event, drive SN, failure reason
- failure reason codes seem to be:
  - 0 - Unknown failure
  - 1 - Device not ready (meaning?)
  - 2 - Selection timeout (meaning?)
  - 3 - User marked the drive dead
  - 4 - Hardware error (meaning?)
  - 5 - Bad block
  - 6 - Retries failed (meaning?)
  - 7 - No Response from drive during discovery
  - 8 - Inquiry failed (meaning?)
  - 9 - Probe (Test Unit Ready/Start Stop Unit) failed
  - A - Bus discovery failed
- DEVICE reports some per-device problems, including:
  - numParityErrors - not necessarily very serious(verify)
  - linkFailures - not necessarily very serious(verify)
  - hwErrors - serious?
  - abortedCmds - not unusual?
  - mediumErrors - indicates failing drive?
  - smartWarning - (not always reliable warnings?)
- EVENT lists

arcconf GETLOGS 1 UART - fetches some fairly raw logs from the controller. Only recent stuff.

Unsorted

Issues

Disks falling out of the array

There are a number of reasons the controller can throw a disk out of the array. A dying disk is only one of them.

Any drive that takes very long to respond will be. (Spinning up an idle drive should not cause this).

It may include

removing a drive without telling the controller(verify) (selection error)

bad sectors on a desktop class drive -- because drive will stall while trying to read/fix, and it may do so for minutes or longer (also depending on readahead)

Drive firmware behaviour/bugs. For example, I ran into a problem on desktop-class Seagate 7200.14-series drives (3TB barracuda) that would show a bunch of aborted commands and link failures (and count mostly as selection failures). Updating from firmware CC4C to CC4H fixed this problem.

SCSI bus appears hung

✎ This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

WARNING: from the internet. Not all checked.

aacraid: Host adapter abort request (0,0,0,0)
aacraid: Host adapter reset request. SCSI hang ?
aacraid: SCSI bus appears hung

The message comes from the aacraid driver (specifically drivers/scsi/aacraid/linit.c function aac_eh_reset(), when it has been asked to reset by linux SCSI subsystem).

This message means the kernel has noticed the or controller isn't responding very fast (based on what exactly? The fact that commands stay in the queue for a long time?) and asks it to reset.

The reset causes the controller to block while it flushes its outstanding IO, after which its resets, and continues. In itself, this amounts to hard-handed flow control. (The controller considers it more important to handle data it already has than to accept new data from the kernel, which means that particularly heavy loads (e.g. and overload of random seeking) can potentially always (help) trigger this behaviour)

If there are no errors (...below this informational message), then it's really just flow control. Your data is fine.

If there are errors below the message, then something more serious is probably happening - look up said error.

There are ways to tweak how likely this warning is, depending on what exactly causes it. For example, changing the controller-side queue depth, and enabling the writeback cache (insert generic battery-back warning here), can both help - but presumably only when it's very hard loads that cause it.

timeout problems

The aacraid controller's internal timeout/recovery cycle is apparently 35 seconds.

The linux SCSI subsystem timeout used to be 30 seconds. If it is, then linux may ask for a reset when controller EH (error handling) took a while. In such cases, the reset is probably just redudant, and causes unnecessary choppiness.

That 30 second linux-SCSI timeout was was changed to a 45 second default (apparently a few years ago?), which means this ought not the be your problem. Check what /sys/block/yourdevice/device/timeout contains. You can set the value there too (set more permanently via inittab?)

Keep in mind that in a healthy array of healthy disks, none of these timeouts ought to be relevant - most regular IO is served within a bunch of milliseconds under light or sequential load, up to perhaps a few hundred ms or a secod or two under load. If something is hanging around for more than half a minute, that's probably unusual at best, and probably a more serious problem, making the reset message an indirect warning of another real problem. Such as:

Consumer-class error handling

If you use consumer-class drives, then any drive-level problem will first be noticed by the drive firmware itself and trigger its own error handling, which may block the controller's command for considerable time.

Long enough for the controller to want to reset it, and sometimes long enough for it to consider it failed.

It's not that it's bad hardware, it's just not a well-tuned combination.

firmware/driver bugs

There are a few known problems that a firmware/driver updates may fix (and probably a few unknown ones).

For example, an older version of aacraid (which? how to diagnose as this case?) had a bug related to hyperthreading.

Updating the firmware/bios (to 5.2.0 Build 18668 or later?(verify)), and the driver (to 1.1.7-28700 or later?(verify)) will fix that one (disabling hyperthreading was a workaround).

There are some known drive firmware bugs.

RAID non-recovery

If you see:

[2740390.344436] sd 4:0:1:0: [sdb] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[2740390.344439] sd 4:0:1:0: [sdb] Sense Key : Hardware Error [current]
[2740390.344442] sd 4:0:1:0: [sdb] Add. Sense: Internal target failure
[2740390.344447] sd 4:0:1:0: [sdb] CDB: Read(10): 28 00 33 dd dc 00 00 00 08 00
[2740390.344454] end_request: I/O error, dev sdb, sector 870177792

This is the adaptec card/driver responding with an error - basically because the applied raid level cannot recover from lower-level misbehaviour (be it drive read errors, or complete drive dropout).

Sense Key is the basic error type. See e.g. the 'SCSI Sense Keys' section on this page. In this case, Hardware Error "Indicates that the target detected a non-recoverable hardware failure (for example, controller failure, device failure, parity error, etc.) while performing the command or during a self test."

Add. Sense can give additional information that a driver can report. I don't think this particular error is very informative(verify)

CDB is a SCSI Command Descriptor Block, which tells you what command failed.

This case may indicate a disk failure (how to check this?)

Hardware incompatibility

Hardware incompatibility is possible, although this is usually about firmware quirks that make things interact badly.

The producer/supplier of the computer, controller, enclosure, or controller may have some advice (or may say nothing more useful than that the combination is not certified, which usually just means they have not tested that specific combination of hardware in their labs)

It is possible to have incompatibility in the interaction between controller, enclosure, and drives (any combination of those).

Even if the drive is enterprise class and expensive, there is no hard guarantee (short of certification).

Diagnosis: hard, unless you find either a note of certification, or that a firmware update fixes a problem.

Other theories

The RAID enclosure is often an I2C device. Interaction of IPMI monitoring and lm-sensors drivers can break things, when both talk at the same time (because because both the BMC and your southbridge can be an I2C master on the same bus).

A flaky power supply, which cannot deliver all the power the drives ask for, can cause some unpredictable behaviour, including selection errors

SMART monitoring

aacraid exposes /dev/sg* devices, mostly for monitoring temperature and SMART.

Some of these devices are things like enclosures rather than drives. A sg_scan -i should reveal what everything is.

You'll probably want to use -d sat argument to smartctl.

A quick and dirty report that may be useful when checking for failed disks:

for dev in $(ls /dev/sg*); do
  echo '------------'
  echo $dev
  smartctl -d sat -a $dev | egrep -i '(realloc|pend|uncorr)'
done

The tweaks for munin (note there seem to be other plugins you can use, and that you may prefer):

hddtemp_smartctl - edit the plugin configuration file both to mention all those devices, and to mention each as type sat (under the right section), e.g.

env.drives sg0 sg1 sg2
env.type_sg0 sat
env.type_sg1 sat
env.type_sg2 sat

smart_ - Link all the devices (so that you'll have smart_sg2, smart_sg3, etc), and edit the configuration file (plugin-conf.d/munin-node), adding something like the following under [smart_] / [smart_*]:

env.smartargs -d sat -a

Observations from some problems

✎ This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

Seagate Barracuda 7200.14 series (1TB-platter platform, in our case 3TB) drive firmware with CC4C firmware have behaviour / a bug where they would sometimes pause its responses for more than half a minute, which makes the RAID controller consider it unresponsive (complain about abortedCommands and linkFailures) and kick it out of the array (DEAD log usually shows failureReasonCode:2, selection timeout). An update to firmware CC4H seems to have fixed this pause.

At one point I had the controller not recognize fifteen slots in an enclosure - and when it first occurred, it took down the system because that included the system drive. The error turned out to be a single drive that had failed in such a thorough way (it makes bad noises when I power it on) to confuse the SATA bus seriously enough to ignore that drive and all after it. (in a test computer it blocked the BIOS tests at 5A for a while, and gave a lot of problems at every attempted access)

Note: These drives would be marked failed in metadata and could not be re-integrated (via HSP state) until their metadata was wiped with something like:

arcconf task start 1 device 0 22 initialize

Computer data storage - RAID - aacraid notes

Contents

Initial setup

On scrubbing

Inspection, and dealing with failed disks

Array/disk overview

Beeping

Disk replacement commands

If it is not yet rebuilding

States and tasks

On SETSTATE

Tasks

GETLOGS

Unsorted

Issues

Disks falling out of the array

SCSI bus appears hung

SMART monitoring

Observations from some problems

Semi-sorted

See also

Navigation menu