Linux admin notes - health and statistics

From Helpful
Revision as of 19:57, 20 September 2021 by Helpful (Talk | contribs)

Jump to: navigation, search
Linux-related notes
Linux user notes

Shell, admin, and both:

Shell - command line and bash notes · shell login - profiles and scripts ·· find and xargs and parallel · screen and tmux ·· Shell and process nitty gritty

Linux admin - disk and filesystem · Init systems and service management (upstart notes, systemd notes) · users and permissions · Debugging · security enhanced linux · PAM notes · health and statistics · kernel modules · YP notes · unsorted and muck

Logging and graphing - Logging · RRDtool and munin notes
Network admin - Firewalling and other packet stuff ·

Remote desktops
VNC notes
XDMCP notes

These are primarily notes
It won't be complete in any sense.
It exists to contain fragments of useful information.

Reading (linux) system use and health

Load average

reported by top, uptime, and more (values typically from /proc/loadavg)


  • decent indication of sustained load on a node
  • not useful for changes on the term of seconds
  • not technically averages, more a lowpass filter on load.

"Load" is an estimate of how many processes are actively doing stuff, specifically:

  • using CPU, or
  • waiting to be schedyuled CPU, or
  • in uninterruptable sleep (often disk, sometimes network or other)

"average" - not actually. It's an expontentially dampened thing - think lowpass. This is relevant in that the 1, 5, 15 figures are not at all "average over that many minutes" but actually how fast these figures adapt to the real load. Which is useful, just not an average.

When you see something like
load average: 1.69, 1.70, 1.73
you can guess that there are probably two sustained processes actively using the CPU (and likely sharing its speed), but since it's under 2.0, one or both are probably not active all the time.

If the number is high, you can usually assume there are many things fighting for CPU or disk. Keep in mind that if you have a many-core processor, more processes can run alongside each other perfectly fine

...though you can't tell whether they're happily working alongside, or throttling your disk system.

When swapping and particularly when trashing, the load factor may spike simply because many things are waiting, while the kernel spends a lot of IO time swapping things in and out.

Process states, CPU use types

Wait time

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)


  • it's probably doing a bunch of reading/writig to disk:
(more likely to last longer (and be noticeable) on on platter disks, due to their higher latency)
  • occasional D state often just means some heavy (disk) work
  • a lot of D state often means overloaded disk system

The state of uninterruptable sleep is typically IO wait time, and shown as 'D' in top, ps and the likes.

Roughly speaking:

For context, S (interruptible sleep) is used for 'pause until resource becomes available before we do a thing' (or 'I have nothing to do') yet will still become active to handle incoming signals.

D (uninterruptible sleep) differs in that it won't handle incoming signals, and is used for processes in syscalls where interrupting them may be a bad to do, so the kernel decides it won't schedule it before such a syscall is done.

While uninterruptible sleep means 'any syscall blocking in an uninterruptable way', in practice this is avoidable for almost everything other than IO, and even then not all the time.

And in practice mostly seen on disk, and then mostly platter disks, mostly for the practical reasons that disk is a magnitude slower than memory, making it the typical thing that would be waited on. ...and then, multiple processes fighting for platter disk amplify that effect (by increasing the amount of seeking involved. ).

Note that various IO calls only sometimes block. For example, disk writes and network sends are often buffered, so for small amounts the syscall just accepts the data in the buffer and returns (to later do its own thing), and only when that buffer is considered full would that syscall block.

...note that for sending large blobs of data, that's what you'd expect, and even want as a sort of flow control.

There are some further cases, like

  • mmapped IO
  • network mounts if their file interface needs it.
  • accessing memory that was swapped out
  • swapping to the point of trashing (will apply to many active processes)
  • a disk with a bad sector trying to correct it before it responds (if paranoid, look at SMART reports)

Digging deeper

To figure out which disk(s), try:

iostat -x 2

(if you don't have iostat, it's usually in a package called sysstat) Without understanding all those columns, it's usually easy enough to read off which disk(s) is going crazy.

Some notes:

  • the reads and writes/second are basically the IOPS figure, which you can compare against expectation of your disk (or array). It's one decent metric of disk utilisation.
  • some things cause IOwait without involving a lot of data - e.g. a lot of seeking caused by fstat()ing a large directory tree.
This is possibly more visible in the await column, which shows the average time (milliseconds) that disk operations stay queued (waiting, seeking, reading/writing and everything around it)
particularly on platter: if it's ≤the drive's seek time then the drive probably does one thing at a time (=is keeping up with the requests). If it's usually more than the drive's seek time, then things are probably regularly waiting on seeks.
Platter drives are often around ~7ms, so a dozen's still fine, while a hundred's starting to be a lot.

Past observing that you have IO wait time, you may want to find out what process and what device is so busy.

A start is to do something like:

while [ 1 ]; do (sleep .3; ps -lyfe | egrep '^D'); done

This still leaves some guesswork, though. Your cause will be one of the processes, and the rest are being held up by it. There is not really a fundamental difference.

Also you'll almost always see ignorable things like

pdflush (kernel process which buffers and flushes data to disk) - in fact until there's contention, this will be in D more than the process where the data comes from. Offloading iowait is sort of the point of pdflush).}}
kworker (kernel process that handles interrupts, timers, IO)
filesystem-supporting processes like jbd2 for (ext[34]), txg_sync for zfs, and so on

You can try to figure out the kind of work a process is doing so hard, e.g. by using strace (probably -c for time summary) to figure out which syscalls it's spending most time in.

There are helper scripts for this, like

You may also find perf, e.g. perf top, useful.

Other system / kernel details

vmstat will give you an overall view of what the kernel is up to, including memory usage, disk blocks use, swap activity, context switches, and more. For an example, for sums over a 3-second interval:

vmstat 3

There are some other reports it can give, see its man page.

A related, slightly nicer looking third party app you may want to look at is dstat.

IO and filesystem usage

(keep in mind you may need to run these as superuser to be particularly informative)

Specific files

fuser list PIDs that have a particular resource open (reading, writing, executing, mmaped; these and more are appended as a letter), e.g.

  • "figure out which filesystem this path is on, then list all processes which have files open on that filesystem". Useful to see what prevents a umount
fuser -m -v /mnt/data4
You can ask fuser to kill the implied processes.
fuser -mik /mnt/data4
the -i, for interactive, is used to avoid accidentally killing way too much by accidentally implying the root filesystem.
...or you could use kill/killall manually.

Note that if you get:

                  USER        PID ACCESS COMMAND
/path:            root     kernel mount /path

...this probably means you've exported it via NFS. In many cases, stopping the kernel service is what you want here.

  • "check for a whole bunch of specific files", e.g. stuff that processes have open in /tmp
find /tmp | xargs fuser -v    # or lazier:
fuser -v /tmp/* /tmp/*/*

  • "Which has this directory open", e.g. your homedir
fuser -v ~

  • "who has TCP port 80, and in a human-readable form, please" (keep in mind you probably get no results without a sudo)
fuser -v -n tcp 80

lsof lists open files. Because of unix's "everything is files" philosophy, this includes sockets, directories, FIFOs, memory-mapped files, and more. It can be used to inspect program (mis)behaviour and such:

  • lsof /data4
    rather similar to fuser -m
  • lsof -u samba
    lists open files for user samba (something fuser cannot do)
  • lsof -c bash
    lists everything related to running bash processes
  • lsof -p 18817
    lists all things opened by a certain process
  • lsof -i
    "Alright, what's networking up to?" Netstat is probably more interesting for this, but looking by port (
    lsof -i :22
    ) and host (
    lsof -i@
    ) is easy enough (see the man page for more details).
  • watch -n 0.1 "lsof -n -- /data /data2 | grep smbd | egrep -i '\b(DIR|REG)\b'"
    : "keep tracking the files and directories that samba keeps open under the /data and /data2 directories"
  • or just a summary of which programs use how many handles:
    lsof | cut -f 1 -d ' ' | sort | uniq -c | sort -n

(Note that different *nix-style systems have different options on lsof)

IO summaries

vmstat gives a summary about processes, memory, swapping, block IO, interrupts, context switches, CPU and more. Good to inspect how a taxed system is being taxed.

For example,
vmstat 2
: show averages every two seconds

It can also show certain fine grained statistics, given kernel support (see the man page).

Things like iotop, and features within atop and htop can be used to show IO speed of processes, and/or totals.

On iotop:

  • needs to be run as root (or have NET_ADMIN capability), which can be impractical
  • iotop -o
    shows just processes with non-zero IO
  • iotop -a
    show cumulative amounts
  • OSError: Netlink error: Invalid argument (22) basically means your kernel doesn't have the support(verify). If you're on centos or rhel, this means 5.6 or later.(verify)


  • ifconfig to see (or configure) the network interfaces
  • netstat will list various things about networking and can show e.g.
    • open connections (no parameter)
    • listens and open connections (-a)
    • udp and/or tcp (-u, -t) since you often don't care about all the unix sockets
    • routing table (-r) (see also route)
    • interface summary (-i)
    • statistics (-s)

I use -pnaut (programname, noresolve, listen+connections, udp, tcp).

  • ss is similar to netstat

  • arp (arp -n to avoid resolves) to see the ARP table
  • route (route -n to avoid resolves) to see the routing table
  • iptables to change the IP filtering/nat/mangling tables (see also iptables). Possibly interesting to you are:
    • iptables-save, which produces file-saveable text (and is also handy to see all of the iptables state), and
    • iptables-restore, which reinstates a file saved through iptables-save.

  • iwconfig to see (or configure) the wireless network interfaces
    • (Other general wireless tools: iwevent, iwspy, iwlist, iwpriv)
    • (Other specific wireless tools: wlanconfig, etc.)

Kernel, drivers

  • lsmod lists currently loaded kernel modules (see also modprobe, insmod, rmmod)
  • lspci lists PCI devices. Using -v is a litte more informative. (see also setpci)
  • lsusb lists USB busses and devices on them

Drives and space

tells you what storage you can get at, and how much space is left on each.

In contrast:

  • /etc/mtab lists things that are mounted, more completely than df does, because df reports only things meant for storage, so which excludes things like proc, udev/devfs, usbfs, and whatnot.
  • To see an exhaustive list of things that the system knows could be mounted, see /etc/fstab (see also fstab).
  • To see swap partition/file use,
    cat /proc/swaps
    will do, which is basically what swapon -s does.

df notes:

  • The
    option is useful to see human-readable sizes.
  • df -B MiB (or MB) makes df report everything in megabytes, which can be useful when you're watching for differences on the order of megabytes per second (e.g. watch -d df -B MiB)
  • ext2, ext3, and ext4, figures not add up exactly, because 5% of the space is reserved (short story: this is a good thing for general use, though in WORM situations it can make sense to set it to 0%).

To see where the big stuff is in the directory tree, use du, detailed elsewhere. (There are also fancier graphical programs for this, such as baobab, that give better visual overview)

RAM health

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)


This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

EDAC ("Error Detection and Correction") reports (among other things) errors in ECC RAM, and PCI bus transfers.

If you see errors in your logs, the main thing to check is whether they mention CE (Corrected Errors) or UE (Uncorrectable Errors)

An occasional CE is to be expected (that's not bad ECC, that's just you noticing the occasional bit flips that regular-RAM people don't notice).

For example:

Sep  3 07:31:36 hostname kernel: [575877.525766] CMCI storm detected: switching to poll mode
Sep  3 07:31:37 hostname kernel: [575878.258800] EDAC MC0: 1900 CE error on CPU#0Channel#1_DIMM#0
   (channel:1 slot:0 page:0x0 offset:0x0 grain:8 syndrome:0x0)
Sep  3 07:31:38 hostname kernel: [575879.259555] EDAC MC0: 1610 CE error on CPU#0Channel#1_DIMM#0
   (channel:1 slot:0 page:0x0 offset:0x0 grain:8 syndrome:0x0)
Sep  3 07:31:39 hostname kernel: [575880.260323] EDAC MC0: 2560 CE error on CPU#0Channel#1_DIMM#0
   (channel:1 slot:0 page:0x0 offset:0x0 grain:8 syndrome:0x0)
Sep  3 07:32:06 hostname kernel: [575907.545221] CMCI storm subsided: switching to interrupt mode

See also:!topic/fa.linux.kernel/3RV3y4Y2WT8%5B1-25-false%5D

CMCI storms

CMCI refer to the way EDAC messages are communicated ((Intel's) Corrected Machine Check Interrupt).

These messages are usually communicated using interrupts, which communicates them almost as soon as possible, which is nice for diagnosis as it logs it sooner rather than later.

However, when there are constant reports, this leads to an interrupt storm (a consistently high rate of interrupts that ends up taking unnecessarily much of a CPU's time).

To avoid that slowdown, the CMCI kernel code switches to polling until the report rate is low again. (see log fragment above)

In itself, a CMCI storm isn't a problem.

However, the cause of that many machine check messages often is, and is often sign of a problematic/broken memory stick, or sometimes bad BIOS-level config of RAM.

If the last thing in your logs before a freeze/panic is a CMCI storm message, that's probably due to uncorrectable memory errors - that OS logs may not mention because it choose to freeze. If your BIOS keeps EDAC logs, that will be a more reliable source.

If you suspect memory issues, you could try to run memtest86 or similar.

Or, on the next bootup, watch a terminal doing
dmesg -w
and try to reproduce, maybe by running run a user-space memtester (not quite as thorough). If a CMCI storm message is the last thing you see in the logs before a freeze or panic, your RAM is suspect.

tools for fairly instantaneous statistics

standard tools

Some of these utilities are fairly standard to most unices, some of them report information specifically from recent linux kernels, and some OSes have better utilities than these


top is a basic overview of CPU, memory, swap, and process statistics. It's a little verbose, and not everything is very important.

Example output:

top - 21:18:31 up 18 days,  1:16, 13 users,  load average: 2.61, 2.47, 2.01
Tasks: 124 total,   4 running, 120 sleeping,   0 stopped,   0 zombie
Cpu(s):  0.0%us,  3.0%sy, 97.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:    773500k total,   760236k used,    13264k free,    36976k buffers
Swap:  5992224k total,   689324k used,  5302900k free,   265284k cached

11192 root      30   5 59136  55m R 43.6  5.5   0:03.30 cc1plus 
11199 root      26   5 12896 9244 R 24.8  0.9   0:00.25 cc1plus 
11193 root      22   5  4972 3332 S  1.0  0.3   0:00.02 i686-pc-linux-gnu-g++
11197 root      15   0  2100 1140 R  1.0  0.1   0:00.07 top
11198 root      24   5  2144  884 S  1.0  0.1   0:00.01 i686-pc-linux-gnu-g++
    1 root      15   0  1496  432 S  0.0  0.0   0:07.40 init [3]
    2 root      34  19     0    0 R  0.0  0.0   0:06.24 [ksoftirqd/0]
...and so on

Process lines

The process lines (the lines after PID USER PR etc.) are a bunch of common information.

A few notes:

  • columns are configurable. I like to simplify things.
  • you can kill and renice from within top (keys: k, r)
  • you can do more from tools like htop


  • %CPU is calculated by top, the percentage of active running time in the last interval
note: samples /proc regularly, so short-lived processes may not be counted at all (atop can be better)
also doesn't show when it's stalled or spinlocking, versus doing useful work
  • PR and NI:
    • Higher PRiority numbers means lower actual priority
    • the value is the basic PRiority (often 20) plus the process's NIceness, which lets users say "I'm in no hurry, other processes can get more CPU time if they want it"
  • Time+ is cumulative CPU time spent.

  • Process state is
usually one of
S (sleeping)
R (runnable)
D (uninterruptable sleep)
others are
Z (exited, but parent has not cleaned it up yet)
T (stopped (as in paused), by job control (SIGSTOP/SIGCONT signals), or something like ptrace)
Much of S/R/D is about resources other than CPU
runnable means resources are there, and it's scheduled to be run - so either running right now, or soon
running means it is scheduled
uninterruptable sleep (D) is waiting for a resource, often device IO within the kernel itself
interruptable sleep (S) can be done voluntarily, or by the kernel, and often means waiting on an event


  • VIRT refers to mapped memory, all memory it could address without error. This includes shared memory, libraries, mmaps, and memory that was reserved but never actually used (promised by allocation, but never backed because it was never used). If there's a lot of the last, VIRT may mean very little about real memory use.
  • RES is how much of a process is RESident in RAM. This is a good indication of how much it uses at all, except when your system is trashing: swapped-out memory does not count towards this.
  • SWAP (not there by default; press 'f', and 'p' in that screen): Amount of memory swapped out. VIRT is often roughly RES+SWAP because most other things are relatively small(verify)
  • %MEM is the percentage of physical memory the task uses (not sure what exactly counts towards it(verify))


That header can be seen as roughly three parts: 'CPU and IO status', 'memory status' and 'swap status':

Overall CPU and IO
top - 21:18:31 up 18 days,  1:16, 13 users,  load average: 2.61, 2.47, 2.01
Tasks: 124 total,   4 running, 120 sleeping,   0 stopped,   0 zombie
Cpu(s):  0.0%us,  3.0%sy, 97.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:    773500k total,   760236k used,    13264k free,    36976k buffers
Swap:  5992224k total,   689324k used,  5302900k free,   265284k cached

The CPU usage division is probably most interesting:

  • user cpu time is programs using CPU time in a nice and scheduled way (and have a niceness of ≤0)
  • system time is the kernel doing things programs asked it to. This has some priority.
  • nice cpu time are programs spending CPU time, but have a niceness of >0 (those that will back off to let processes with lower niceness use more CPU time)
  • wait time typically means the system is waiting for IO. If this happens the system is often either doing hard IO work, or swapping like crazy. Some wait time is implicit in IO work, a lot of it is is bad in that it indicates an IO bottleneck. If it's because of heavy swapping it is often avoidable.
  • hi and si: 'hard interrupts' and 'soft interrupts'. They represent driver time, networking and a few other things. They are rarely higher than a few percent.

Zombie processes are not important, unless there are many. They are processes that are finished, are not using resources anymore but have not yet been cleaned up by the process that started them.

Overall memory
top - 21:18:31 up 18 days,  1:16, 13 users,  load average: 2.61, 2.47, 2.01
Tasks: 124 total,   4 running, 120 sleeping,   0 stopped,   0 zombie
Cpu(s):  0.0%us,  3.0%sy, 97.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:    773500k total,   760236k used,    13264k free,    36976k buffers
Swap:  5992224k total,   689324k used,  5302900k free,   265284k cached

Total is the user-usable RAM: the amount of physical memory minus a bit of kernel memory.

Used is physical RAM used by applications, basically resident + cache + buffers.

Free means "going to waste", not being used by anything. It will generally only be high right after bootup and right after memory hogging process just stopped and the OS cache hasn't seen anything to use that space for.

Buffers and Cached both have a slightly complicated history, but roughly, buffers is the kernel stuff that needs to be there, and cached is things that can often be evicted.

  • The disk writeback cache (yay confusing teminology) - count toward buffers
usually low, so you can completely ignore this unless it isn't low.
  • ext[34] journal, dentry, and other metadata (similar for jbd, ocfs2) - count towards caches
also usually small
  • the OS page cache cache - count towards caches (
...things that the system thinks you may need again soon, often mostly filesystem metadata and data. It's helpful because it's faster than disk, and avoids IO.
you can consider this memory available in the sense that the kernel can almost immediately throw it away when a program asks for actual memory
but it is not counted towards 'free', because it isn't. Over time, you can expect any host to have a whole bunch in
  • mmap()ped files seem to count towards cached. This can sometimes be both a lot and fairly constant, such as with large mmapped logs.
  • some other near-kernel stuff
my SUnreclaim, which is slab, is GBytes large, which
(a more formatted version of /proc/slabinfo) revealed was mostly ZFS, specifically its ARC

What I want instead of 'Free' is Free+Cached, which is basically the 'usable RAM' figure. You can eyeball it from top's figures, or let
tell you basically this figure. The figure from its output that is most interesting is the one in bold below:
             total       used       free     shared    buffers     cached
Mem:        773500     760236      13264          0      36976     265284
-/+ buffers/cache:     457976     315524
Swap:      5992224     689324    5302900
top - 21:18:31 up 18 days,  1:16, 13 users,  load average: 2.61, 2.47, 2.01
Tasks: 124 total,   4 running, 120 sleeping,   0 stopped,   0 zombie
Cpu(s):  0.0%us,  3.0%sy, 97.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:    773500k total,   760236k used,    13264k free,    36976k buffers
Swap:  5992224k total,   689324k used,  5302900k free,   265284k cached

Swap total reflects the collective size of your enabled swap partitions. Usage is just that, and there is always some use, usually things that have not been used, such as allocated memory that has never been accessed, or parts of large executables that have never been used (It only makes sense to swap this out, as it gives more memory to active processes). (See also swappiness)

When the used swap is high, you have an active memory hog, too little memory -- or sometimes a program that likes to allocate a lot of memory without using it (linux considers this swapped, because it wants to guarantee it can back that allocation, but doesn't want to use RAM for this).

You can approximate the difference by seeing whether the swap figures change all the time. A better indicator is something like vmstat: if it reports si and so ('swapped in', 'swapped out') as 0 most of the time, the size you see in swap is probably not actively used)}}

Continuous swapping will make your computer sluggish. If this isn't a rare occurence, it may be wise to invest in a little more memory, or look whether you have memory hogs that can be configured to color within the lines a bit better.


Yes, there's a perl script. I think this one's just as simple:

# will implicitly use the current user's top settings (particularly the columns)
[ -z "$1" ] && exit  #quit if there were no arguments
top -d 1 -b -n 1 | sed -n -e '/PID/p' -e "/$1/p" | grep -v "\/$1\/" | grep -v topgrep | grep -v 'grep -v'


can be considered a non-interactive equivalent to top.

ps is convenient for some quick checks like:

ps faux | grep smbd    # is service running?
ps faux | grep ssh     # show connected sessions

and also for scripting, as you can control the output format. For example, consider:

# script to renice all of a user's processes. You'll often need root rights.
USER=${1:?Missing username}
NICENESS=${2:-10}   # that's 10, - is the bash syntax for default-if-missing
renice $NICENESS -p `ps --no-headers -U $USER -o pid`

# continuously list processes in uninterruptible sleep (see what hits the disk)
# (note: D+ means foreground)
while true
  ps -eo stat,user,comm,pid | egrep '^D'
  sleep 0.1

# script to summarize users that use CPU and memory
ps --no-headers -axeo user,%cpu,%mem | \
  awk '{usercpu[$1]+=$2; usermem[$1]+=$3} 
       END { for (u in usercpu) { if (usercpu[u]>5 || usermem[u]>5) 
             printf("%15s using  %4d%% CPU  and %4d%% resident memory\n", 
                        u, usercpu[u], usermem[u]) }  }'
      postgres using     0% CPU  and    9% resident memory
      www-data using     5% CPU  and   30% resident memory
          root using     9% CPU  and    3% resident memory

Selecting and stopping: pidof, kill, killall

When you don't have a GUI or shell way of killing a program, you'll have to use slightly harsher means when you want to kill a program. The old fashioned way is to get its process id with pidof and then use kill or, failing that, kill -9.

The difference is that the former defaults to the TERM signal (15) which can be received by the process which can choose to shut down nicely - in fact, signals are used for more than termination. (Particular HUP (hangup, 1) has been used as a 'reload configuration files' signal). The signal just mentioned 9, KILL, is untrappable and instructs the kernel to kill the process somewhat harshly. It is the surest way to kill, but means no cleanup in terms of child processes, IO, and such ((verify) which), so should only be used if the default TERM didn't work.

Kill takes a process id, which you can get from top, ps and others., or more directly via pidof. Using killall enables you to use a process name instead of a PID. Summarizing:

# pidof firefox-bin
8222 8209 8208 8204
# kill 8222 8209 8208 8204
# kill `pidof firefox-bin`
# killall firefox-bin

You do have to match the actual name. For example, if firefox is a script that runs an actual executable called firefox-bin, then killall firefox doesn't always do you much good.

As a regular user you can only kill their own processes, root can kill anything (except some kernel processes).

Processes will not die while they are in IOwait. This usually doesn't matter, unless it is blocked on a single call for a very long time. You'll want to take away the thing they are blocked on before they'll die. This may not be a simple task.

nonstandard tools

As in, ones you'll have to install, but are probably useful.


Compared to plain top:

  • more more compact ascii-arty summary of some things
  • searchable
  • slightly handier, e.g. you can select multiple processes and kill all in selection
  • slighty more useful columns
  • more configurable (if you like that kinda thing)



Sort of a variant of top that also summarizes

  • disk IO
  • network IO
  • disk mounts/use

...and colors some key ones on whether they seem to be unduly loaded.

glances tool

I personally like to type less than that. I imitate the psgrep perl script mentioned here and there with a much more basic, fragile, and close-to-what-I-generally-want:

[ -z "$1" ] && exit  #quit when there are no arguments
ps ax | sed -n -e '1 p' -e "/$1/p" | grep -v "\/$1\/" | grep -v psgrep | grep -v 'grep -v'

This will look for the text fragment you type (the first word, it ignores the rest), and give you ps's headers for reference.

The grepping is a hacky, hacky attempt at removing the matching grep and sed processes that are actually this psgrep script itself. The '1 p' unconditionally lets through ps's header.

You can do something similar for top (can be handy e.g. for memory statistics), using its batch mode.

[ -z "$1" ] && exit  #quit if there were no arguments
top -b -n 1 | sed -n -e '/COMMAND/p' -e "/$1/p" | grep -v "\/$1\/" | grep -v topgrep | grep -v 'grep -v'

This too is quite hacky, and note the format top uses depends on its .toprc file. The header line is included by grepping for COMMAND, which I figured is the field most likely to be present.

Recorded statistics

See Network tools. The somewhat wider tools like cacti also store system statistics.



monit is a service that periodically checks a configurable set of properties of a system or process on it.

It makes it fairly easy to check

  • whether common serevices like websites or mail are up - at protocol level
  • practical things like "does the SSL certificate expire within X days"
  • resource use (check for high/unusual load on cpu, network, disk)
  • disk space check
  • network interface link, link speed
  • "if bad the last X checks" to avoid false-positive spam
  • file changes (uid/gid/permission/checksum)

Actions bwyond alerts, e.g.

  • is process still running? If not, restart (e.g. used in docker)
  • execute, e.g. "if log is big, run logrotate", 'if lost IP, restart interface'

See also for a detailed overview

I is primarily set up to send alerts and show overall status, not so much for presenting pretty graphs.

There's a status exportin XML (though not JSON in bare monit)

Monit is free open source software.

M/Monit is a paid-for wrapper that gives prettier graphs, and gives overview for many hosts. See also The pricing is good for datacenters and such, though not for home users.

See also