Linux admin notes - health and statistics

From Helpful
(Redirected from Uninterruptable sleep)
Jump to: navigation, search
Linux-related notes
Linux user notes

Shell, admin, and both:

Shell - command line and bash notes · shell login - profiles and scripts ·· find and xargs and parallel · screen and tmux ·· Shell and process nitty gritty


Linux admin - disk and filesystem · Init systems and service management (upstart notes, systemd notes) · users and permissions · Debugging · security enhanced linux · PAM notes · health and statistics · kernel modules · YP notes · unsorted and muck


Logging and graphing - Logging · RRDtool and munin notes
Network admin - Firewalling and other packet stuff ·


Remote desktops
VNC notes
XDMCP notes



These are primarily notes
It won't be complete in any sense.
It exists to contain fragments of useful information.


Reading (linux) system use and health

Load average

reported by top, uptime, and more (values typically from /proc/loadavg)


tl;dr:

  • decent indication of sustained load on a node
  • not useful for changes on the term of seconds
  • not technically averages, more a lowpass filter on load.


"Load" is an estimate of how many processes are actively doing stuff, specifically:

  • using CPU, or
  • waiting to be schedyuled CPU, or
  • in uninterruptable sleep (often disk, sometimes network or other)

"average" - not actually. It's an expontentially dampened thing - think lowpass. This is relevant in that the 1, 5, 15 figures are not at all "average over that many minutes" but actually how fast these figures adapt to the real load. Which is useful, just not an average.


When you see something like
load average: 1.69, 1.70, 1.73
you can guess that there are probably two sustained processes actively using the CPU (and likely sharing its speed), but since it's under 2.0, one or both are probably not active all the time.


If the number is high, you can usually assume there are many things fighting for CPU or disk. Keep in mind that if you have a many-core processor, more processes can run alongside each other perfectly fine

...though you can't tell whether they're happily working alongside, or throttling your disk system.


When swapping and particularly when trashing, the load factor may spike simply because many things are waiting, while the kernel spends a lot of IO time swapping things in and out.

Process states, CPU use types

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

At low level, there are relatively few states[1], but in practice you also care about what the scheduler thinks.

The combination reports roughly the state letters you'll see in ps and top.

  • R - running
  • S - sleeping (interruptible)
  • I - noload idle (a distinction from S introduced later) (verify)
  • D - disk sleep (not interruptible)
  • T - stopped (paused by job control)
  • t - tracing stop (paused for tracing reasons)
  • X - dead
  • Z - zombie (user code has exited, process is sitting around waiting for kernel to remove the process)

The scheduler has changed over time, which is why some of these are new, like

I as a no-load idle (versus S often 'waiting for event') (verify) (previously N?(verify))
X (though apparently you shouln't ever see this

And some that were introduced, then removed again

x dead (what difference?)
K wakekill "wake on signals that are deadly"
W waking (not reported anymore?, internal intermediate you would never really see(verify))
(did it mean paging in a process'?)
  • P - Parked (not reported anymore?, internal intermediate you would never really see(verify))


The difference between interruptible and uninterruptible sleep seems to be an implication of the semantics / documented guarantees of IO syscalls (and couldn't be changed without breaking a lot of software).

  • uninterruptable sleep
used when, for correctness, certain signals should not be handled
in practice typically means "paused because of blocking system call, and specifically device IO because most other syscalls can and will be fulfilled immediately"
  • interruptable sleep - not doing anything, or waiting for an event before becoming runnable again
can be voluntary (e.g. sleep()), or forced by the kernel
still allows signal handlers
note there is a distinct 'ready and in memory' and 'ready, but in swap'


The scheduler has some further distinctions you generally wouldn't care about

  • runnable, a.k.a. ready - waiting to be to be scheduled on the CPU, but not currently on there
normal part of scheduling
note there is a distinction between:
'ready and in memory' (just a context switch away)
'ready, but in swap' (need swap in first)
  • running, in user code
  • running, in kernel code (because of syscall or interrupt)
  • preempted - for scheduling reasons(verify)
  • running in interrupt


Things like
top
has some 'time spent' with its own split:
  • Nice - time spent in user code in processes with niceness >=1
  • User - time spent in user code in processes with niceness <=0
  • System - time spent in kernel code, often "...fulfilling nontrivial syscalls"
including some non-blocking IO calls (the parts buffered by the system and fulfilled later)
  • Wait - time spent waiting in D (uninterruptable sleep) state, typically because of blocking IO syscalls
which usually means the process is not being scheduled at least some of the time
  • hi, si: hardware and software interrupts
  • st: time spent waitint on a hypervisor to schedule us (verify)


Wait time

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

tl;dr

  • it's probably doing a bunch of reading/writig to disk:
(more likely to last longer (and be noticeable) on on platter disks, due to their higher latency)
  • occasional D state often just means some heavy (disk) work
  • a lot of D state often means overloaded disk system


The state of uninterruptable sleep is typically IO wait time, and shown as 'D' in top, ps and the likes.


Roughly speaking:

For context, S (interruptible sleep) is used for 'pause until resource becomes available before we do a thing' (or 'I have nothing to do') yet will still become active to handle incoming signals.


D (uninterruptible sleep) differs in that it won't handle incoming signals, and is used for processes in syscalls where interrupting them may be a bad to do, so the kernel decides it won't schedule it before such a syscall is done.

While uninterruptible sleep means 'any syscall blocking in an uninterruptable way', in practice this is avoidable for almost everything other than IO, and even then not all the time.


And in practice mostly seen on disk, and then mostly platter disks, mostly for the practical reasons that disk is a magnitude slower than memory, making it the typical thing that would be waited on. ...and then, multiple processes fighting for platter disk amplify that effect (by increasing the amount of seeking involved. ).

Note that various IO calls only sometimes block. For example, disk writes and network sends are often buffered, so for small amounts the syscall just accepts the data in the buffer and returns (to later do its own thing), and only when that buffer is considered full would that syscall block.

...note that for sending large blobs of data, that's what you'd expect, and even want as a sort of flow control.


There are some further cases, like

  • mmapped IO
  • network mounts if their file interface needs it.
  • accessing memory that was swapped out
  • swapping to the point of trashing (will apply to many active processes)
  • a disk with a bad sector trying to correct it before it responds (if paranoid, look at SMART reports)


Digging deeper

To figure out which disk(s), try:

iostat -x 2

(if you don't have iostat, it's usually in a package called sysstat) Without understanding all those columns, it's usually easy enough to read off which disk(s) is going crazy.

Some notes:

  • the reads and writes/second are basically the IOPS figure, which you can compare against expectation of your disk (or array). It's one decent metric of disk utilisation.
  • some things cause IOwait without involving a lot of data - e.g. a lot of seeking caused by fstat()ing a large directory tree.
This is possibly more visible in the await column, which shows the average time (milliseconds) that disk operations stay queued (waiting, seeking, reading/writing and everything around it)
particularly on platter: if it's ≤the drive's seek time then the drive probably does one thing at a time (=is keeping up with the requests). If it's usually more than the drive's seek time, then things are probably regularly waiting on seeks.
Platter drives are often around ~7ms, so a dozen's still fine, while a hundred's starting to be a lot.


Past observing that you have IO wait time, you may want to find out what process and what device is so busy.

A start is to do something like:

while [ 1 ]; do (sleep .3; ps -lyfe | egrep '^D'); done

This still leaves some guesswork, though. Your cause will be one of the processes, and the rest are being held up by it. There is not really a fundamental difference.


Also you'll almost always see ignorable things like

pdflush (kernel process which buffers and flushes data to disk) - in fact until there's contention, this will be in D more than the process where the data comes from. Offloading iowait is sort of the point of pdflush).}}
kworker (kernel process that handles interrupts, timers, IO)
filesystem-supporting processes like jbd2 for (ext[34]), txg_sync for zfs, and so on


You can try to figure out the kind of work a process is doing so hard, e.g. by using strace (probably -c for time summary) to figure out which syscalls it's spending most time in.

There are helper scripts for this, like https://github.com/scarfboy/various/blob/master/straceD

You may also find perf, e.g. perf top, useful.


Other system / kernel details

vmstat will give you an overall view of what the kernel is up to, including memory usage, disk blocks use, swap activity, context switches, and more. For an example, for sums over a 3-second interval:

vmstat 3

There are some other reports it can give, see its man page.

A related, slightly nicer looking third party app you may want to look at is dstat.


IO and filesystem usage

(keep in mind you may need to run these as superuser to be particularly informative)

Specific files

fuser list PIDs that have a particular resource open (reading, writing, executing, mmaped; these and more are appended as a letter), e.g.

  • "figure out which filesystem this path is on, then list all processes which have files open on that filesystem". Useful to see what prevents a umount
fuser -m -v /mnt/data4
You can ask fuser to kill the implied processes.
e.g.
fuser -mik /mnt/data4
the -i, for interactive, is used to avoid accidentally killing way too much by accidentally implying the root filesystem.
...or you could use kill/killall manually.

Note that if you get:

                  USER        PID ACCESS COMMAND
/path:            root     kernel mount /path

...this probably means you've exported it via NFS. In many cases, stopping the kernel service is what you want here.


  • "check for a whole bunch of specific files", e.g. stuff that processes have open in /tmp
find /tmp | xargs fuser -v    # or lazier:
fuser -v /tmp/* /tmp/*/*


  • "Which has this directory open", e.g. your homedir
fuser -v ~


  • "who has TCP port 80, and in a human-readable form, please" (keep in mind you probably get no results without a sudo)
fuser -v -n tcp 80



lsof lists open files. Because of unix's "everything is files" philosophy, this includes sockets, directories, FIFOs, memory-mapped files, and more. It can be used to inspect program (mis)behaviour and such:

  • lsof /data4
    rather similar to fuser -m
  • lsof -u samba
    lists open files for user samba (something fuser cannot do)
  • lsof -c bash
    lists everything related to running bash processes
  • lsof -p 18817
    lists all things opened by a certain process
  • lsof -i
    "Alright, what's networking up to?" Netstat is probably more interesting for this, but looking by port (
    lsof -i :22
    ) and host (
    lsof -i@192.168.5.5
    ) is easy enough (see the man page for more details).
  • watch -n 0.1 "lsof -n -- /data /data2 | grep smbd | egrep -i '\b(DIR|REG)\b'"
    : "keep tracking the files and directories that samba keeps open under the /data and /data2 directories"
  • or just a summary of which programs use how many handles:
    lsof | cut -f 1 -d ' ' | sort | uniq -c | sort -n

(Note that different *nix-style systems have different options on lsof)


IO summaries

vmstat gives a summary about processes, memory, swapping, block IO, interrupts, context switches, CPU and more. Good to inspect how a taxed system is being taxed.

For example,
vmstat 2
: show averages every two seconds

It can also show certain fine grained statistics, given kernel support (see the man page).



Things like iotop, and features within atop and htop can be used to show IO speed of processes, and/or totals.

On iotop:

  • needs to be run as root (or have NET_ADMIN capability), which can be impractical
  • iotop -o
    shows just processes with non-zero IO
  • iotop -a
    show cumulative amounts
  • OSError: Netlink error: Invalid argument (22) basically means your kernel doesn't have the support(verify). If you're on centos or rhel, this means 5.6 or later.(verify)

Networking

  • ifconfig to see (or configure) the network interfaces
  • netstat will list various things about networking and can show e.g.
    • open connections (no parameter)
    • listens and open connections (-a)
    • udp and/or tcp (-u, -t) since you often don't care about all the unix sockets
    • routing table (-r) (see also route)
    • interface summary (-i)
    • statistics (-s)

I use -pnaut (programname, noresolve, listen+connections, udp, tcp).

  • ss is similar to netstat


  • arp (arp -n to avoid resolves) to see the ARP table
  • route (route -n to avoid resolves) to see the routing table
  • iptables to change the IP filtering/nat/mangling tables (see also iptables). Possibly interesting to you are:
    • iptables-save, which produces file-saveable text (and is also handy to see all of the iptables state), and
    • iptables-restore, which reinstates a file saved through iptables-save.


  • iwconfig to see (or configure) the wireless network interfaces
    • (Other general wireless tools: iwevent, iwspy, iwlist, iwpriv)
    • (Other specific wireless tools: wlanconfig, etc.)

Kernel, drivers

  • lsmod lists currently loaded kernel modules (see also modprobe, insmod, rmmod)
  • lspci lists PCI devices. Using -v is a litte more informative. (see also setpci)
  • lsusb lists USB busses and devices on them

Drives and space

df
tells you what storage you can get at, and how much space is left on each.


In contrast:

  • /etc/mtab lists things that are mounted, more completely than df does, because df reports only things meant for storage, so which excludes things like proc, udev/devfs, usbfs, and whatnot.
  • To see an exhaustive list of things that the system knows could be mounted, see /etc/fstab (see also fstab).
  • To see swap partition/file use,
    cat /proc/swaps
    will do, which is basically what swapon -s does.


df notes:

  • The
    -h
    option is useful to see human-readable sizes.
  • df -B MiB (or MB) makes df report everything in megabytes, which can be useful when you're watching for differences on the order of megabytes per second (e.g. watch -d df -B MiB)
  • ext2, ext3, and ext4, figures not add up exactly, because 5% of the space is reserved (short story: this is a good thing for general use, though in WORM situations it can make sense to set it to 0%).


To see where the big stuff is in the directory tree, use du, detailed elsewhere. (There are also fancier graphical programs for this, such as baobab, that give better visual overview)

RAM health

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

EDAC

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

EDAC ("Error Detection and Correction") reports (among other things) errors in ECC RAM, and PCI bus transfers.


If you see errors in your logs, the main thing to check is whether they mention CE (Corrected Errors) or UE (Uncorrectable Errors)

An occasional CE is to be expected (that's not bad ECC, that's just you noticing the occasional bit flips that regular-RAM people don't notice).

However...


See also:

http://www.admin-magazine.com/Articles/Monitoring-Memory-Errors

http://askubuntu.com/questions/334328/typed-apt-get-update-now-showing-a-long-list-of-edac-errors-is-there-anything

https://groups.google.com/forum/#!topic/fa.linux.kernel/3RV3y4Y2WT8%5B1-25-false%5D


CMCI storms

For example, you saw your log mention:

Sep  3 07:31:36 hostname kernel: [575877.525766] CMCI storm detected: switching to poll mode
Sep  3 07:31:37 hostname kernel: [575878.258800] EDAC MC0: 1900 CE error on CPU#0Channel#1_DIMM#0
   (channel:1 slot:0 page:0x0 offset:0x0 grain:8 syndrome:0x0)
Sep  3 07:31:38 hostname kernel: [575879.259555] EDAC MC0: 1610 CE error on CPU#0Channel#1_DIMM#0
   (channel:1 slot:0 page:0x0 offset:0x0 grain:8 syndrome:0x0)
Sep  3 07:31:39 hostname kernel: [575880.260323] EDAC MC0: 2560 CE error on CPU#0Channel#1_DIMM#0
   (channel:1 slot:0 page:0x0 offset:0x0 grain:8 syndrome:0x0)
...
Sep  3 07:32:06 hostname kernel: [575907.545221] CMCI storm subsided: switching to interrupt mode


CMCI itself refer to the way EDAC messages are communicated ((Intel's) Corrected Machine Check Interrupt).

CMCI messages are usually communicated using interrupts, to move them almost as soon as possible, which is nice for diagnosis because it logs it sooner rather than later.


However, when there are constant reports, this leads to an interrupt storm - a consistently high rate of interrupts that ends up taking away unnecessarily much of a CPU's time.

To avoid that slowdown, the CMCI kernel code switches to polling until the report rate is low again. (see log fragment above)


The CMCI storm itself isn't really the problem.

However, the cause of that many machine check messages typically is a problem, as it is often a sign of problematic/broken memory stick, or sometimes bad BIOS-level config of RAM.


If the last thing in your logs before a freeze/panic is a CMCI storm message, that's probably due to uncorrectable memory errors - that OS logs may not mention because the kernel chose to panic before that could be written.

If your BIOS keeps recent EDAC messages, that will be a more reliable source.

If you suspect memory issues, consider running memtest86 or similar.

And/or: on the next bootup, watch a terminal doing
dmesg -w
and try to reproduce, maybe by running run a user-space memtester (not quite as thorough). If a CMCI storm message is the last thing you see in the logs before a freeze or panic, your RAM is suspect.

tools for quick statistics

standard tools

Some of these utilities are fairly standard to most unices, some of them report information specifically from recent linux kernels, and some OSes have better utilities than these

top

top is a basic overview of CPU, memory, swap, and process statistics. It's a little verbose, and not everything is very important.

Example output:

top - 21:18:31 up 18 days,  1:16, 13 users,  load average: 2.61, 2.47, 2.01
Tasks: 124 total,   4 running, 120 sleeping,   0 stopped,   0 zombie
Cpu(s):  0.0%us,  3.0%sy, 97.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:    773500k total,   760236k used,    13264k free,    36976k buffers
Swap:  5992224k total,   689324k used,  5302900k free,   265284k cached

  PID USER      PR  NI  VIRT  RES S %CPU %MEM    TIME+  COMMAND
11192 root      30   5 59136  55m R 43.6  5.5   0:03.30 cc1plus 
11199 root      26   5 12896 9244 R 24.8  0.9   0:00.25 cc1plus 
11193 root      22   5  4972 3332 S  1.0  0.3   0:00.02 i686-pc-linux-gnu-g++
11197 root      15   0  2100 1140 R  1.0  0.1   0:00.07 top
11198 root      24   5  2144  884 S  1.0  0.1   0:00.01 i686-pc-linux-gnu-g++
    1 root      15   0  1496  432 S  0.0  0.0   0:07.40 init [3]
    2 root      34  19     0    0 R  0.0  0.0   0:06.24 [ksoftirqd/0]
...and so on




ps

ps
can be considered a non-interactive equivalent to top.


ps is convenient for some quick checks like:

ps faux | grep smbd    # is service running?
ps faux | grep ssh     # show connected sessions


and also for scripting, as you can control the output format. For example, consider:

#!/bin/bash
# script to renice all of a user's processes. You'll often need root rights.
USER=${1:?Missing username}
NICENESS=${2:-10}   # that's 10, - is the bash syntax for default-if-missing
renice $NICENESS -p `ps --no-headers -U $USER -o pid`


#!/bin/bash
# continuously list processes in uninterruptible sleep (see what hits the disk)
# (note: D+ means foreground)
while true
do
  ps -eo stat,user,comm,pid | egrep '^D'
  sleep 0.1
done


#!/bin/bash
# script to summarize users that use CPU and memory
ps --no-headers -axeo user,%cpu,%mem | \
  awk '{usercpu[$1]+=$2; usermem[$1]+=$3} 
       END { for (u in usercpu) { if (usercpu[u]>5 || usermem[u]>5) 
             printf("%15s using  %4d%% CPU  and %4d%% resident memory\n", 
                        u, usercpu[u], usermem[u]) }  }'
      postgres using     0% CPU  and    9% resident memory
      www-data using     5% CPU  and   30% resident memory
          root using     9% CPU  and    3% resident memory


Selecting and stopping: pidof, kill, killall

When you don't have a GUI or shell way of killing a program, you'll have to use slightly harsher means when you want to kill a program. The old fashioned way is to get its process id with pidof and then use kill or, failing that, kill -9.

The difference is that the former defaults to the TERM signal (15) which can be received by the process which can choose to shut down nicely - in fact, signals are used for more than termination. (Particular HUP (hangup, 1) has been used as a 'reload configuration files' signal). The signal just mentioned 9, KILL, is untrappable and instructs the kernel to kill the process somewhat harshly. It is the surest way to kill, but means no cleanup in terms of child processes, IO, and such ((verify) which), so should only be used if the default TERM didn't work.


Kill takes a process id, which you can get from top, ps and others., or more directly via pidof. Using killall enables you to use a process name instead of a PID. Summarizing:

# pidof firefox-bin
8222 8209 8208 8204
# kill 8222 8209 8208 8204
# kill `pidof firefox-bin`
# killall firefox-bin

You do have to match the actual name. For example, if firefox is a script that runs an actual executable called firefox-bin, then killall firefox doesn't always do you much good.


As a regular user you can only kill their own processes, root can kill anything (except some kernel processes).


Processes will not die while they are in IOwait. This usually doesn't matter, unless it is blocked on a single call for a very long time. You'll want to take away the thing they are blocked on before they'll die. This may not be a simple task.

nonstandard tools

As in, ones you'll have to install, but are probably useful.

htop

Compared to plain top:

  • more more compact ascii-arty summary of some things
  • searchable
  • slightly handier, e.g. you can select multiple processes and kill all in selection
  • slighty more useful columns
  • more configurable (if you like that kinda thing)

htop


atop

glances

Sort of a variant of top that also summarizes

  • disk IO
  • network IO
  • disk mounts/use

...and colors some key ones on whether they seem to be unduly loaded.

glances tool


Tools

See also

http://tldp.org/LDP/sag/html/index.html