Linux admin notes - health and statistics

From Helpful
(Redirected from Uninterruptable sleep)
Jump to navigation Jump to search

Linux-related notes
Linux user notes

Shell, admin, and both:

Shell - command line and bash notes · shell login - profiles and scripts ·· find and xargs and parallel · screen and tmux ·· Shell and process nitty gritty ·· Isolating shell environments ·· Shell flow control notes
Linux admin - disk and filesystem · Linux networking · Init systems and service management (upstart notes, systemd notes) · users and permissions · Debugging · security enhanced linux · PAM notes · health and statistics · Machine Check Events · kernel modules · YP notes · unsorted and muck · Automounters and permissions
Logging and graphing - Logging · RRDtool and munin notes
Network admin - Firewalling and other packet stuff ·
Remote desktops

📃 These are primarily notes, intended to be a collection of useful fragments, that will probably never be complete in any sense.

Reading (linux) system use and health

Load average (top, uptime, etc.)

Load average is three numbers, looking like 4.14, 7.52, 8.75, are figures of how heavily processes want CPU within the last one, five, and fifteen minutes.

Load average is reported by top and other overview tools, uptime, and more)}}. If you want to read just those values yourself, see /proc/loadavg

Technical interpretation

The number is based on a count of active processes, where 'active' means was running on CPU, or was scheduled to be on CPU, or is in uninterruptable sleep (often disk, sometimes network or other), so roughly "really wants to be doing something now".

💤 Technically this is about scheduling more than use, but in practice the two tend to be well correlated.
The 'average' in load average is technically not true - it's more of a exponentially dampened thing, which works similarly but is better at showing trends and ignoring singular peaks.

More intuitive interpretation

When you see 4.14, 7.52, 8.75 you can guess

there are probably four processes active now
and eight or nine over the last ten minutes.

If this number hangs around the amount of CPU-cores you have, it's keeping itself busy nicely enough.

Higher than that -- well. tl;dr: you cannot easily tell from this whether that is a 'happily fully busy' or 'maybe start considering this overloaded'.

may just mean there happens to be a lot of work right now and we're all sharing the CPU, and which you probably wouldn't call overloaded
may be a direct result of it trashing due to overcommitted RAM --- because most of the scheduled tasks can't start until data is swapped in from disk

...for which you want to see the amount of swapping happening (e.g. via vmstat), or just by seeing many processes are in D state

Process states, CPU use types

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

At low level, there are relatively few states[1], but in practice you also care about what the scheduler thinks.

The combination reports roughly the state letters you'll see in ps and top.

  • R - running
  • S - sleeping (interruptible)
  • I - noload idle (distinction from S introduced later - apparently S often 'waiting for event') (verify) (previously N?(verify))
  • D - disk sleep (not interruptible)
  • T - stopped (paused by job control)
  • t - tracing stop (paused for process tracing reasons)
  • X - dead (though apparently you shouldn't ever see this?)
  • Z - zombie (user code has exited, process is sitting around waiting for kernel to remove the process)

The scheduler has changed over time, which is why some of these are new, like I and X (and some are considered internal extended guts not as easily gotten at(verify)).

And some that were introduced, then removed again

  • x - dead (what difference?)
  • K - wakekill "wake on signals that are deadly"
  • W - waking (not reported anymore?, internal intermediate you would never really see(verify))
(did it mean paging in a process'?)
  • P - Parked (not reported anymore?, internal intermediate you would never really see(verify))

More on (un)interruptable sleep and wait time - For context

The difference between interruptible and uninterruptible sleep seems to be an implication of the semantics / documented guarantees of IO syscalls (and couldn't be changed without breaking a lot of software).

  • uninterruptable sleep
usually means 'process asked for process and kernel are waiting for an event that, for correctness reasons, should not be interrupted by certain signals'
in practice often syscalls, and then excluding syscalls that are fulfulled immediately, and those that can be restarted after being interrupted
...often means "blocking reads from disk or other IO"
  • interruptable sleep - not doing anything, or waiting for an event before becoming runnable again
can be voluntary (e.g. sleep()), can be forced by the kernel
kernel can force signals, which would be handled
(note there is a distinct 'ready and in memory' and 'ready, but in swap')

So S state (interruptible sleep) is used for 'pause until resource becomes available before we do a thing' (or 'I have nothing to do'), yet would still become active if we send it signals.

and D state (uninterruptible sleep) differs in that it won't handle incoming signals, and usually means processes did a syscall of the sort where interrupting them may be a bad to do, so the kernel decides it won't schedule this process again before such a syscall is done.

In practice, uninterruptable sleep is avoidable for most all things - other than some IO.

As such, uninterruptable sleep is very commonly specifically IO wait time.

...and then often specifically disk (other IO tends to be done quickly), and then frequently platter disk, mostly for the practical reasons that disk is a magnitude or three slower than memory, particularly for random access, making it the typical thing that would be waited on (when writing a lot of data or writing it to a lot of different locations).

Note that various IO calls only sometimes block. For example, disk writes and network sends are often buffered, so for small amounts the syscall just accepts the data in the buffer and returns (to later do its own thing), and only when that buffer is considered full would that syscall block.

Note that for sending large blobs of data, that's what you'd expect, and this waiting works out as a crude form of backpressure-style flow control.

There are some further cases, like

  • mmapped IO
  • network mounts if their file interface needs it.
  • accessing memory that was swapped out
  • swapping to the point of trashing (will apply to many active processes)
  • a disk with a bad sector trying to correct it before it responds (if paranoid, look at SMART reports)

Digging deeper

To figure out which disk(s), try (if you don't have iostat, it's usually in a package called sysstat):

 iostat -x 2

Even without understanding all those columns, it's usually easy enough to read off which disk(s) is going crazy and which are idle.

Some notes:

  • the reads and writes/second are basically the IOPS figure, which you can compare against expectation of your disk (or array). It's one decent metric of disk utilisation.
  • some things cause IOwait without involving a lot of data - e.g. a lot of seeking caused by fstat()ing a large directory tree.
This is possibly more visible in the await column, which shows the average time (milliseconds) that disk operations stay queued (waiting, seeking, reading/writing and everything around it)
particularly on platter: if it's ≤the drive's seek time then the drive probably does one thing at a time (=is keeping up with the requests). If it's usually more than the drive's seek time, then things are probably regularly waiting on seeks.
Platter drives are often around ~7ms, so a dozen's still fine, while a hundred's starting to be a lot.

Past observing that you have IO wait time, you may want to find out what process and what device is so busy.

A crude start is to do something like:

 while [ 1 ]; do (sleep .3; ps -lyfe | egrep '^D'); done

This still leaves some guesswork, though. Your cause will be one of the processes, and the rest are being held up by it. There is not really a fundamental difference.

Also you'll almost always see ignorable things like

pdflush (kernel process which buffers and flushes data to disk) - in fact until there's contention, this will be in D more than the process where the data comes from. Offloading iowait is sort of the point of pdflush).}}
kworker (kernel process that handles interrupts, timers, IO)
filesystem-supporting processes like jbd2 for (ext[34]), txg_sync and spl_dynamic_tas for ZFS, and so on
smartd (shouldn't actually block the disk)

You can try to figure out the kind of work a process is doing so hard, e.g. by using strace (probably -c for time summary) to figure out which syscalls it's spending most time in.

There are helper scripts for this, like

You may also find perf useful - see e.g. perf top.

Memory use

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

IO and filesystem usage

🛈 Keep in mind you may need to run these as superuser to be particularly informative

Specific files


fuser list PIDs that have a particular resource open (reading, writing, executing, mmaped; these and more are appended as a letter), e.g.

  • fuser -v ~ "Which has this directory open", e.g. your homedir (note that it doesn't check for things under it. Your homedir happens to be somewhat special because shells tend to start there)
  • find /tmp - "check for a whole bunch of specific files", e.g. stuff that processes have open in /tmp
  • fuser -m -v /mnt/data4 - "figure out which filesystem this path is on, then list all processes which have files open on that filesystem". Useful to see what prevents a umount, or drive spindown
  • fuser -v -n tcp 80 - "who has TCP port 80, and in a human-readable form, please"


lsof lists open files. Because of unix's "everything is files" philosophy, this includes sockets, directories, FIFOs, memory-mapped files, and more.

  • lsof /data4 - if you give it a mountpoint, it should list the files open on that filesystem (verify)
    • watch -n 0.1 "lsof -n -- /data /data2 | grep smbd | egrep -i '\b(DIR|REG)\b'": Assuming those are mount points, "keep tracking the files and directories that samba keeps open on these filesystems
  • lsof -u samba lists open files for user samba (something fuser cannot do)
  • lsof -c bash lists everything related to running bash processes
  • lsof -p 18817 lists all things opened by a certain process
  • lsof -i -n "Alright, what's networking up to?"
Netstat is probably more interesting for this, but looking by port (lsof -i :22) and host (lsof -i@ is easy enough (see the man page for more details).
  • or just a summary of which programs use how many handles: lsof | cut -f 1 -d ' ' | sort | uniq -c | sort -n

(Note that different *nix-style systems have different options on lsof)

IO summaries

vmstat gives a summary about processes, memory, swapping, block IO, interrupts, context switches, CPU and more. Good to inspect how a taxed system is being taxed.

For example, vmstat 2: show averages every two seconds It can also show certain fine grained statistics, given kernel support (see the man page).

Output looks something like:

procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 1  0  98304 9078160 1073804 6927344    0    0    34    59    6   14  3  1 96  0  0
 1  0  99072 9200024 1066264 6875312    0  378   192   474 2130 7133  4  6 89  0  0
 1  0  99584 9259084 1061024 6859368    0  234    64   426 2420 7606  4  7 89  0  0
 1  0 100096 9361264 1054180 6805064    0  364   432   546 2456 7879  5  7 89  0  0
 1  0 100608 9444436 1052964 6774128    0  250   128   432 2255 6804  4  6 90  0  0
 1  0 101120 9529820 1050992 6742676    0  272    70   606 2528 7703  5  7 88  0  0
 2  0 101120 9527324 1050996 6742632    0    0   142   160 2250 5922  4  6 89  0  0
 0  0 101120 9530388 1050996 6742764    0    0   128    72 2042 5344  4  5 90  0  0

Things like iotop, and features within atop and htop can be used to show IO speed of processes, and/or totals.

On iotop:

  • needs to be run as root (or have NET_ADMIN capability), which can be impractical
  • iotop -o shows just processes with non-zero IO
  • iotop -a show cumulative amounts
the combination of -o and -a can more clearly show the more active tasks
  • OSError: Netlink error: Invalid argument (22) basically means your kernel doesn't have the support(verify). If you're on centos or rhel, this means 5.6 or later.(verify)


  • ifconfig to see (or configure) the network interfaces
  • netstat will list various things about networking and can show e.g.
    • open connections (no parameter)
    • listens and open connections (-a)
    • udp and/or tcp (-u, -t) since you often don't care about all the unix sockets
    • routing table (-r) (see also route)
    • interface summary (-i)
    • statistics (-s)

I use -pnaut (programname, noresolve, listen+connections, udp, tcp).

  • ss is similar to netstat

  • arp (arp -n to avoid resolves) to see the ARP table
  • route (route -n to avoid resolves) to see the routing table
  • iptables to change the IP filtering/nat/mangling tables (see also iptables). Possibly interesting to you are:
    • iptables-save, which produces file-saveable text (and is also handy to see all of the iptables state), and
    • iptables-restore, which reinstates a file saved through iptables-save.

  • iwconfig to see (or configure) the wireless network interfaces
    • (Other general wireless tools: iwevent, iwspy, iwlist, iwpriv)
    • (Other specific wireless tools: wlanconfig, etc.)

Kernel, drivers

  • lsmod lists currently loaded kernel modules (see also modprobe, insmod, rmmod)
  • lspci lists PCI devices. Using -v is a litte more informative. (see also setpci)
  • lsusb lists USB busses and devices on them

Drives and space

df tells you what storage you can get at, and how much space is left on each.

In contrast:

  • /etc/mtab lists things that are mounted, more completely than df does, because df reports only things meant for storage, so which excludes things like proc, udev/devfs, usbfs, and whatnot.
  • To see an exhaustive list of things that the system knows could be mounted, see /etc/fstab (see also fstab).
  • To see swap partition/file use, cat /proc/swaps will do, which is basically what swapon -s does.

df notes:

  • The -h option is useful to see human-readable sizes.
  • df -B MiB (or MB) makes df report everything in megabytes, which can be useful when you're watching for differences on the order of megabytes per second (e.g. watch -d df -B MiB)
short story:
this is a good thing for general-use and particular system disks
though in WORM situations it can make sense to set it to 0%

To see where the big stuff is in the directory tree, use du, detailed elsewhere.

If you want an easier-to-interact-with tool, look at ncdu

There is better, graphical overview from things like baobab, filelight or kdirstat, that give better visual overview.

RAM health

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.


This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

EDAC ("Error Detection and Correction") reports (among other things) errors in ECC RAM, and also PCI bus transfers.

If you see errors in your logs, the main thing to check is whether they mention CE (Corrected Errors) or UE (Uncorrectable Errors)

An occasional CE is to be expected (that's not bad ECC, that's just you noticing the occasional bit flips that regular-RAM people don't notice).


See also:!topic/fa.linux.kernel/3RV3y4Y2WT8%5B1-25-false%5D

CMCI storms

If you saw your log mention something like:

Sep  3 07:31:36 hostname kernel: [575877.525766] CMCI storm detected: switching to poll mode
Sep  3 07:31:37 hostname kernel: [575878.258800] EDAC MC0: 1900 CE error on CPU#0Channel#1_DIMM#0
   (channel:1 slot:0 page:0x0 offset:0x0 grain:8 syndrome:0x0)
Sep  3 07:31:38 hostname kernel: [575879.259555] EDAC MC0: 1610 CE error on CPU#0Channel#1_DIMM#0
   (channel:1 slot:0 page:0x0 offset:0x0 grain:8 syndrome:0x0)
Sep  3 07:31:39 hostname kernel: [575880.260323] EDAC MC0: 2560 CE error on CPU#0Channel#1_DIMM#0
   (channel:1 slot:0 page:0x0 offset:0x0 grain:8 syndrome:0x0)
Sep  3 07:32:06 hostname kernel: [575907.545221] CMCI storm subsided: switching to interrupt mode

For context: CMCI itself is Corrected Machine Check Interrupt, Intel's way of communicating EDAC messages.

CMCI messages are usually communicated using interrupts, to move them as soon as possible, which is nice for diagnosis because it logs it sooner rather than later. Possibly even before a kernel panic.

However, when there are continuous reports (that are not crashy yet), this would lead to a consistently high rate of hardware interrupts, which ends up taking away unnecessarily much of a CPU's time -- an interrupt storm. To avoid the associated slowdown, the kernel code switches CMCI communication to polling instead (fetching chunks at a time, and at the cost of them arriving a little later), until the report rate is low again. (see log fragment above)

So the CMCI storm itself isn't the problem, but the cause behind then usually is.

Say, when it is a sign of a problematic or broken memory stick.

Or sometimes bad BIOS-level config of RAM.

If the last thing in your logs before a freeze/panic is a CMCI storm message, that's probably due to hitting an uncorrectable memory error soon after. : The OS logs may not mention it because the kernel chose to panic before that could be written.

If your BIOS keeps recent EDAC messages, that will be a more reliable source.

If you do suspect memory issues, consider running memtest86 or similar.

And/or: on the next bootup, watch a terminal doing dmesg -w and try to reproduce, maybe by running run a user-space memtester (not as thorough - you can only check RAM that isn't allocated yet). If a CMCI storm message is the last thing you see in the logs before a freeze or panic, your RAM is suspect.

lm-sensors notes

tools for quick statistics

overview tools


top is a basic overview of CPU, memory, swap, and process statistics. It's a little verbose, and not everything is very important.

Example output:

top - 21:18:31 up 18 days,  1:16, 13 users,  load average: 2.61, 2.47, 2.01
Tasks: 124 total,   4 running, 120 sleeping,   0 stopped,   0 zombie
Cpu(s):  0.0%us,  3.0%sy, 97.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:    773500k total,   760236k used,    13264k free,    36976k buffers
Swap:  5992224k total,   689324k used,  5302900k free,   265284k cached

11192 root      30   5 59136  55m R 43.6  5.5   0:03.30 cc1plus 
11199 root      26   5 12896 9244 R 24.8  0.9   0:00.25 cc1plus 
11193 root      22   5  4972 3332 S  1.0  0.3   0:00.02 i686-pc-linux-gnu-g++
11197 root      15   0  2100 1140 R  1.0  0.1   0:00.07 top
11198 root      24   5  2144  884 S  1.0  0.1   0:00.01 i686-pc-linux-gnu-g++
    1 root      15   0  1496  432 S  0.0  0.0   0:07.40 init [3]
    2 root      34  19     0    0 R  0.0  0.0   0:06.24 [ksoftirqd/0]
...and so on


When all you wanted is a list of processes, ps is a simpler, non-interactive tool

ps is convenient for some quick checks like:

ps faux | grep smbd    # is service running?
ps faux | grep ssh     # show connected sessions

and also for scripting, as you can control the output format. For example, consider:

# script to renice all of a user's processes. You'll often need root rights.
USER=${1:?Missing username}
NICENESS=${2:-10}   # that's 10, - is the bash syntax for default-if-missing
renice $NICENESS -p `ps --no-headers -U $USER -o pid`

# continuously list processes in uninterruptible sleep (most: see which processes hit the disk harder)
# (note: D+ means foreground)
while true
  ps -eo stat,user,comm,pid | egrep '^D'
  sleep 0.1

# script to mention summarize users that use nontrivial CPU or memory
ps --no-headers -axeo user,%cpu,%mem | \
  awk '{usercpu[$1]+=$2; usermem[$1]+=$3} 
       END { for (u in usercpu) { if (usercpu[u]>5 || usermem[u]>5) 
             printf("%15s using  %4d%% CPU  and %4d%% resident memory\n", 
                        u, usercpu[u], usermem[u]) }  }'
      postgres using     0% CPU  and    9% resident memory
      www-data using     5% CPU  and   30% resident memory
          root using     9% CPU  and    3% resident memory

specific tools tools


vmstat will give you an overall view of what the kernel is up to, including memory usage, disk blocks use, swap activity, context switches, and more. For an example, for sums over a 3-second interval:

vmstat 3

There are some other reports it can give, see its man page.

A related, slightly nicer looking third party app you may want to look at is dstat.


iostat mostly gives IO statistics for devices and partitions.

It can give total read and written, and/or give summaries over short intervals

By default it gives just read and write speeds, but it can also give you the size of requests, the wait times, and more.

nonstandard tools

As in, ones you'll probably have to install.


Compared to plain top:

  • more more compact ascii-arty summary of some things
  • searchable
  • slightly handier, e.g. you can select multiple processes and kill all in selection
  • slighty more useful columns
  • more configurable (if you like that kinda thing)




In addition to CPU and memory and swap (like top)

  • disk IO
  • network IO
  • disk mounts/use

...and colors some key ones on whether they seem to be unduly loaded.

glances tool


Since shared memory just counts towards each participating process's mapped memory (and RSS and more), SHM use causes any tool that does "add up RSS to get total memory use" to over-report.

By a lot whenever things use significant SHM.

So smem goes through all the memory maps, and ends up with a new metric, PSS (proportional set size), where the total does add up correctly, by reporting SHM spread among all its users.

Which is fake, but adds up correctly, and is a more meaningful figurefor a quick glance.

You have probably noticed smem is slow. This mostly isn't the smem tool itself, it's the kernel generating the underlying /proc/PID/smaps report - doing so for a few hundred processes takes a second or two or five.



See also