Linux admin notes - health and statistics

From Helpful
Jump to: navigation, search
Linux-related notes
Linux user notes

Shell, admin, and both:

Shell - command line and bash notes · shell login - profiles and scripts · Shells and execution ·· find and xargs and parallel · screen and tmux
Linux admin - disk and filesystem · users and permissions · Debugging · security enhanced linux · health and statistics · kernel modules · YP notes · unsorted and muck
Logging and graphing - Logging · RRDtool and munin notes
Network admin - Firewalling and other packet stuff ·

Remote desktops
VNC notes
XDMCP notes

These are primarily notes
It won't be complete in any sense.
It exists to contain fragments of useful information.

Reading (linux) system use and health

Load average

reported by top, uptime, and more (values typically from /proc/loadavg)


  • decent indication of sustained load on a node
  • not useful for changes on the term of seconds
  • not technically averages, more a lowpass filter on load.

"Load" is an estimate of how many processes are actively doing stuff, specifically:

  • using CPU, or
  • waiting to be schedyuled CPU, or
  • in uninterruptable sleep (often disk, sometimes network or other)

"average" - not actually. It's an expontentially dampened thing - think lowpass. This is relevant in that the 1, 5, 15 figures are not at all "average over that many minutes" but actually how fast these figures adapt to the real load. Which is useful, just not an average.

When you see something like
load average: 1.69, 1.70, 1.73
you can guess that there are probably two sustained processes actively using the CPU (and likely sharing its speed), but since it's under 2.0, one or both are probably not active all the time.

If the number is high, you can usually assume there are many things fighting for CPU or disk. Keep in mind that if you have a many-core processor, more processes can run alongside each other perfectly fine

...though you can't tell whether they're happily working alongside, or throttling your disk system.

When swapping and particularly when trashing, the load factor may spike simply because many things are waiting, while the kernel spends a lot of IO time swapping things in and out.

CPU use types

Wait time

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

The state of uninterruptable sleep (shown as status 'D' in top and often in ps) is often largely the more specific IO wait time.

Iowait is the time a process spends sitting around waiting for system to fulfill a blocking IO request.

And specifically file IO (including mmapped stuff, and network mounts because they imply file interface), but not e.g. sockets, IPC, or networking.

(More technically, possible process states include both interruptible sleep and uninterruptible sleep. Interruptible still allows signalling, uninterruptible sleep does not and is useful when certain signals should not be handled - which includes cases where a process is waiting on device IO.)

IOwait means the process does not get scheduled for CPU work - though does count towards the load average (verify).

Note that while iowait actually measures CPU scheduling times, this reason that a process doesn't get scheduled makes it a decent measure of disk slowness.

But far from a perfect one, for multiple reasons (TODO: more detail).

The most common cause for iowait is platter-disk operations, mostly because individually they are slower than most other causes, and when they apply, they apply constantly.

This wait for IO is as (un)avoidable as the disk access itself is. Processes doing little IO tend not to wait long, also because OSes put a buffer inbetween that means the IO subsystem does the actual writing, and so also does the actual waiting. As such, this is only a problem when you do so much IO that that buffer no longer saves you and you and/or other processes start spend most of their time waiting. (You can bypass this, but for many uses you wouldn't want to)

When there are multiple programs using a disk, the average seek time can only get bigger. Random accesses also makes for longer waiting.

It may be longer than strictly necessary, in particular when actively swapping (or even trashing), which can be improved by making programs not be hogs, and/or by sticking in more memory (...the latter can have a double improvement: less/no swapping, and memory to spare means more commonly-requested disk data stays cached in memory ,meaning less actual disk IO for that data).

There are other sources of wait time, any device that handles a lot of data, or does a lot small interactions.

Note also that you can have a healthy system with high iowait, and you can have a IO bottleneck with near-zero iowait.

A better measure of whether your disks are getting pushed hard is IOPS, and the time operations spent queued. iostat can help estimate both.

(Consider a 7200RPM drive can do ~100 sequential. (5400RPM does 50-80, 10k does 120-150, 15k does 170-200), a single process doing random will be somewhat lower, many things fighting will be lower yet. SSDs are faster -- but usually also vary more due to their internal management sometimes kicking in; see also TRIM and such).

To figure out which disk(s), try: (if you don't have it, it's usually in a package called sysstat):

iostat -x 2

If there is one continuously busy drive, that's usually easy to read off even without knowing what most of that means.

This is also a better metric (than iowait) of disk utilisation - the reads and writes/second are basically the IOPS figure, which you can compare against expectation of your disk (or array)

Some things aren't much data but still a lot of work, such as the thousands of fstat()s by scanning a large directory tree. This is possibly more visible in the await column, which shows the average time (milliseconds) that disk operations stay queued (waiting, seeking, reading/writing and everything around it), particularly on platter: if it's ≤the drive's seek time then the drive probably does one thing at a time (=is keeping up with the requests). If it's usually more than the drive's seek time, then things are probably regularly waiting on seeks. Platter drives are often around ~7ms, so a dozen's still fine, while a hundred's starting to be a lot.

Past observing that you have IO wait time, you may want to find out what process and what device is so busy.

A start is to do something like:

while [ 1 ]; do (sleep .3; ps -lyfe | egrep '^D'); done

This requires some guesswork, though.

Partly fundamental - consider that if you have one process occupying a disk fully, it means all processes that do file IO to the same disk, not matter how small, will also be in D. Much shorter, yes, but they will show up.

Also you'll almost always see ignorable things like

pdflush (kernel process which buffers and flushes data to disk) - in fact until there's contention, this will be in D more than the process where the data comes from. Offloading iowait is sort of the point of pdflush).}}
kworker (kernel process that handles interrupts, timers, IO)
filesystem-supporting processes like jbd2 for (ext[34]), txg_sync for zfs, and so on

You can try to figure out the kind of work it's doing so hard, e.g. by using strace (probably -c for time summary) to figure out which syscalls it's spending most time in. There are helper scripts for this, like

Other system / kernel details

vmstat will give you an overall view of what the kernel is up to, including memory usage, disk blocks use, swap activity, context switches, and more. For an example, for sums over a 3-second interval:

vmstat 3

There are some other reports it can give, see its man page.

A related, slightly nicer looking third party app you may want to look at is dstat.

IO and filesystem usage

(keep in mind you may need to run these as superuser to be particularly informative)

Specific files

fuser list PIDs that have a particular resource open (reading, writing, executing, mmaped; these and more are appended as a letter), e.g.

  • "figure out which filesystem this path is on, then list all processes which have files open on that filesystem". Useful to see what prevents a umount
fuser -m -v /mnt/data4
You can ask fuser to kill the implied processes.
fuser -mik /mnt/data4
the -i, for interactive, is used to avoid accidentally killing way too much by accidentally implying the root filesystem.
...or you could use kill/killall manually.

Note that if you get:

                  USER        PID ACCESS COMMAND
/path:            root     kernel mount /path

...this probably means you've exported it via NFS. In many cases, stopping the kernel service is what you want here.

  • "check for a whole bunch of specific files", e.g. stuff that processes have open in /tmp
find /tmp | xargs fuser -v    # or lazier:
fuser -v /tmp/* /tmp/*/*

  • "Which has this directory open", e.g. your homedir
fuser -v ~

  • "who has TCP port 80, and in a human-readable form, please" (keep in mind you probably get no results without a sudo)
fuser -v -n tcp 80

lsof lists open files. Because of unix's "everything is files" philosophy, this includes sockets, directories, FIFOs, memory-mapped files, and more. It can be used to inspect program (mis)behaviour and such:

  • lsof /data4
    rather similar to fuser -m
  • lsof -u samba
    lists open files for user samba (something fuser cannot do)
  • lsof -c bash
    lists everything related to running bash processes
  • lsof -p 18817
    lists all things opened by a certain process
  • lsof -i
    "Alright, what's networking up to?" Netstat is probably more interesting for this, but looking by port (
    lsof -i :22
    ) and host (
    lsof -i@
    ) is easy enough (see the man page for more details).
  • watch -n 0.1 "lsof -n -- /data /data2 | grep smbd | egrep -i '\b(DIR|REG)\b'"
    : "keep tracking the files and directories that samba keeps open under the /data and /data2 directories"
  • or just a summary of which programs use how many handles:
    lsof | cut -f 1 -d ' ' | sort | uniq -c | sort -n

(Note that different *nix-style systems have different options on lsof)

IO summaries

vmstat gives a summary about processes, memory, swapping, block IO, interrupts, context switches, CPU and more. Good to inspect how a taxed system is being taxed.

For example,
vmstat 2
: show averages every two seconds

It can also show certain fine grained statistics, given kernel support (see the man page).

Things like iotop, and features within atop and htop can be used to show IO speed of processes, and/or totals.

On iotop:

  • needs to be run as root (or have NET_ADMIN capability), which can be impractical
  • iotop -o
    shows just processes with non-zero IO
  • iotop -a
    show cumulative amounts
  • OSError: Netlink error: Invalid argument (22) basically means your kernel doesn't have the support(verify). If you're on centos or rhel, this means 5.6 or later.(verify)


  • ifconfig to see (or configure) the network interfaces
  • netstat will list various things about networking and can show e.g.
    • open connections (no parameter)
    • listens and open connections (-a)
    • udp and/or tcp (-u, -t) since you often don't care about all the unix sockets
    • routing table (-r) (see also route)
    • interface summary (-i)
    • statistics (-s)

I use -pnaut (programname, noresolve, listen+connections, udp, tcp).

  • ss is similar to netstat

  • arp (arp -n to avoid resolves) to see the ARP table
  • route (route -n to avoid resolves) to see the routing table
  • iptables to change the IP filtering/nat/mangling tables (see also iptables). Possibly interesting to you are:
    • iptables-save, which produces file-saveable text (and is also handy to see all of the iptables state), and
    • iptables-restore, which reinstates a file saved through iptables-save.

  • iwconfig to see (or configure) the wireless network interfaces
    • (Other general wireless tools: iwevent, iwspy, iwlist, iwpriv)
    • (Other specific wireless tools: wlanconfig, etc.)

Kernel, drivers

  • lsmod lists currently loaded kernel modules (see also modprobe, insmod, rmmod)
  • lspci lists PCI devices. Using -v is a litte more informative. (see also setpci)
  • lsusb lists USB busses and devices on them

Drives and space

tells you what storage you can get at, and how much space is left on each.

In contrast:

  • /etc/mtab lists things that are mounted, more completely than df does, because df reports only things meant for storage, so which excludes things like proc, udev/devfs, usbfs, and whatnot.
  • To see an exhaustive list of things that the system knows could be mounted, see /etc/fstab (see also fstab).
  • To see swap partition/file use,
    cat /proc/swaps
    will do, which is basically what swapon -s does.

df notes:

  • The
    option is useful to see human-readable sizes.
  • df -B MiB (or MB) makes df report everything in megabytes, which can be useful when you're watching for differences on the order of megabytes per second (e.g. watch -d df -B MiB)
  • ext2, ext3, and ext4, figures not add up exactly, because 5% of the space is reserved (short story: this is a good thing for general use, though in WORM situations it can make sense to set it to 0%).

To see where the big stuff is in the directory tree, use du, detailed elsewhere. (There are also fancier graphical programs for this, such as baobab, that give better visual overview)

RAM health

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)


This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

EDAC ("Error Detection and Correction") reports (among other things) errors in ECC RAM, and PCI bus transfers, where applicable.

If you see errors in your logs, the main thing to check is whether they mention

CE (Corrected Errors)
UE (Uncorrectable Errors)

An occasional CE is to be expected (that's not bad ECC, that's just you noticing the occasional bit flips that regular-RAM people don't notice).

For example:

Sep  3 07:31:36 hostname kernel: [575877.525766] CMCI storm detected: switching to poll mode
Sep  3 07:31:37 hostname kernel: [575878.258800] EDAC MC0: 1900 CE error on CPU#0Channel#1_DIMM#0
   (channel:1 slot:0 page:0x0 offset:0x0 grain:8 syndrome:0x0)
Sep  3 07:31:38 hostname kernel: [575879.259555] EDAC MC0: 1610 CE error on CPU#0Channel#1_DIMM#0
   (channel:1 slot:0 page:0x0 offset:0x0 grain:8 syndrome:0x0)
Sep  3 07:31:39 hostname kernel: [575880.260323] EDAC MC0: 2560 CE error on CPU#0Channel#1_DIMM#0
   (channel:1 slot:0 page:0x0 offset:0x0 grain:8 syndrome:0x0)
Sep  3 07:32:06 hostname kernel: [575907.545221] CMCI storm subsided: switching to interrupt mode

See also:!topic/fa.linux.kernel/3RV3y4Y2WT8%5B1-25-false%5D

CMCI storms

CMCI refer to the way EDAC messages are communicated ((Intel's) Corrected Machine Check Interrupt).

It typically uses interrupts, but at times there are constant reports this leads to an interrupt storm, and to serious slowdown, so the kernel switches to polling until the report rate is low again.

Note that you still don't want this, as it tends to mean of a problematic/broken memory stick (or sometimes bad ram hardware config).

It may be that your system freezes after this due to uncorrectable memory errors. If you suspect that, on the next bootup watch a terminal doing dmesg -w (and maybe run memtester); if a CMCI storm is the last thing you see, probably.

tools for fairly instantaneous statistics

standard tools

Some of these utilities are fairly standard to most unices, some of them report information specifically from recent linux kernels, and some OSes have better utilities than these


top is a basic overview of CPU, memory, swap, and process statistics. It's a little verbose, and not everything is very important.

Example output:

top - 21:18:31 up 18 days,  1:16, 13 users,  load average: 2.61, 2.47, 2.01
Tasks: 124 total,   4 running, 120 sleeping,   0 stopped,   0 zombie
Cpu(s):  0.0%us,  3.0%sy, 97.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:    773500k total,   760236k used,    13264k free,    36976k buffers
Swap:  5992224k total,   689324k used,  5302900k free,   265284k cached

11192 root      30   5 59136  55m R 43.6  5.5   0:03.30 cc1plus 
11199 root      26   5 12896 9244 R 24.8  0.9   0:00.25 cc1plus 
11193 root      22   5  4972 3332 S  1.0  0.3   0:00.02 i686-pc-linux-gnu-g++
11197 root      15   0  2100 1140 R  1.0  0.1   0:00.07 top
11198 root      24   5  2144  884 S  1.0  0.1   0:00.01 i686-pc-linux-gnu-g++
    1 root      15   0  1496  432 S  0.0  0.0   0:07.40 init [3]
    2 root      34  19     0    0 R  0.0  0.0   0:06.24 [ksoftirqd/0]
...and so on

Process lines

The process lines (the lines after PID USER PR etc.) are a bunch of common information. A few notes:

  • columns are configurable. I usually simplify things.
  • you can kill and renice from within top (keys: k, r)


  • %CPU is calculated by top, the percentage of active running time in the last interval
note: samples /proc regularly, so short-lived processes may not be counted at all (atop can be better)
also doesn't show when it's stalled or spinlocking, versus doing useful work
  • PR and NI:
    • Higher PRiority numbers means lower actual priority
    • the value is the basic PRiority (often 20) plus the process's NIceness, which lets users say "I'm in no hurry, other processes can get more CPU time if they want it"
  • Time+ is cumulative CPU time spent.

  • Process state is
usually one of
S (sleeping)
R (runnable)
D (uninterruptable sleep)
others are
Z (exited, but parent has not cleaned it up yet)
T (stopped (as in paused), by job control (SIGSTOP/SIGCONT signals), or something like ptrace)
Much of S/R/D is about resources other than CPU
runnable means resources are there, and it's scheduled to be run - so either running right now, or soon
running means it is scheduled
uninterruptable sleep (D) is waiting for a resource, often device IO within the kernel itself
interruptable sleep (S) can be done voluntarily, or by the kernel, and often means waiting on an event


  • VIRT refers to mapped memory, all memory it could address without error. This includes shared memory, libraries, mmaps, and memory that was reserved but never actually used (promised by allocation, but never backed because it was never used). If there's a lot of the last, VIRT may mean very little about real memory use.
  • RES is how much of a process is RESident in RAM. This is a good indication of how much it uses at all, except when your system is trashing: swapped-out memory does not count towards this.
  • SWAP (not there by default; press 'f', and 'p' in that screen): Amount of memory swapped out. VIRT is often roughly RES+SWAP because most other things are relatively small(verify)
  • %MEM is the percentage of physical memory the task uses (not sure what exactly counts towards it(verify))


That header can be seen as roughly three parts: 'CPU and IO status', 'memory status' and 'swap status':

Overall CPU and IO
top - 21:18:31 up 18 days,  1:16, 13 users,  load average: 2.61, 2.47, 2.01
Tasks: 124 total,   4 running, 120 sleeping,   0 stopped,   0 zombie
Cpu(s):  0.0%us,  3.0%sy, 97.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:    773500k total,   760236k used,    13264k free,    36976k buffers
Swap:  5992224k total,   689324k used,  5302900k free,   265284k cached

The CPU usage division is probably most interesting:

  • user cpu time is programs using CPU time in a nice and scheduled way (and have a niceness of ≤0)
  • system time is the kernel doing things programs asked it to. This has some priority.
  • nice cpu time are programs spending CPU time, but have a niceness of >0 (those that will back off to let processes with lower niceness use more CPU time)
  • wait time typically means the system is waiting for IO. If this happens the system is often either doing hard IO work, or swapping like crazy. Some wait time is implicit in IO work, a lot of it is is bad in that it indicates an IO bottleneck. If it's because of heavy swapping it is often avoidable.
  • hi and si: 'hard interrupts' and 'soft interrupts'. They represent driver time, networking and a few other things. They are rarely higher than a few percent.

Zombie processes are not important, unless there are many. They are processes that are finished, are not using resources anymore but have not yet been cleaned up by the process that started them.

Overall memory
top - 21:18:31 up 18 days,  1:16, 13 users,  load average: 2.61, 2.47, 2.01
Tasks: 124 total,   4 running, 120 sleeping,   0 stopped,   0 zombie
Cpu(s):  0.0%us,  3.0%sy, 97.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:    773500k total,   760236k used,    13264k free,    36976k buffers
Swap:  5992224k total,   689324k used,  5302900k free,   265284k cached

Total is the user-usable RAM: the amount of physical memory minus a bit of kernel memory.

Used is physical RAM used by applications, basically resident + cache + buffers.

Free means "going to waste", not being used by anything. It will generally only be high right after bootup and right after memory hogging process just stopped and the OS cache hasn't seen anything to use that space for.

Buffers is primarily the disk 'write cache' (yay confusing teminology). Write caches are more implication and necessity than read caches, and it's usually low, so you can completely ignore this unless it isn't low.

Cached is a few things, but typically primarily the OS disk cache: things that the system thinks you may need again soon, often mostly filesystem metadata and data. It's helpful because it's faster than disk, and avoids IO. Memory used for this cache is counted as used in the 'free memory' figure, which is why 'free' is not very useful - cached data can will move out of the way for allocation very fast, so almost all memory in cache is effectively usable memory.

mmap()ped files seem to count towards cached. This can sometimes be both a lot and fairly constant, such as with large mmapped logs.

What I want instead of 'Free' is Free+Cached, which is basically the 'usable RAM' figure. You can eyeball it from top's figures, or let
tell you basically this figure. The figure from its output that is most interesting is the one in bold below:
             total       used       free     shared    buffers     cached
Mem:        773500     760236      13264          0      36976     265284
-/+ buffers/cache:     457976     315524
Swap:      5992224     689324    5302900

top - 21:18:31 up 18 days,  1:16, 13 users,  load average: 2.61, 2.47, 2.01
Tasks: 124 total,   4 running, 120 sleeping,   0 stopped,   0 zombie
Cpu(s):  0.0%us,  3.0%sy, 97.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:    773500k total,   760236k used,    13264k free,    36976k buffers
Swap:  5992224k total,   689324k used,  5302900k free,   265284k cached

Swap total reflects the collective size of your enabled swap partitions. Usage is just that, and there is always some use, usually things that have not been used, such as allocated memory that has never been accessed, or parts of large executables that have never been used (It only makes sense to swap this out, as it gives more memory to active processes). (See also swappiness)

When the used swap is high, you have an active memory hog, too little memory -- or sometimes a program that likes to allocate a lot of memory without using it (linux considers this swapped, because it wants to guarantee it can back that allocation, but doesn't want to use RAM for this).

You can approximate the difference by seeing whether the swap figures change all the time. A better indicator is something like vmstat: if it reports si and so ('swapped in', 'swapped out') as 0 most of the time, the size you see in swap is probably not actively used)}}

Continuous swapping will make your computer sluggish. If this isn't a rare occurence, it may be wise to invest in a little more memory, or look whether you have memory hogs that can be configured to color within the lines a bit better.


Yes, there's a perl script. I think this one's just as simple:

# will implicitly use the current user's top settings (particularly the columns)
[ -z "$1" ] && exit  #quit if there were no arguments
top -d 1 -b -n 1 | sed -n -e '/PID/p' -e "/$1/p" | grep -v "\/$1\/" | grep -v topgrep | grep -v 'grep -v'


can be considered a non-interactive equivalent to top.

ps is convenient for some quick checks like:

ps faux | grep smbd    # is service running?
ps faux | grep ssh     # show connected sessions

and also for scripting, as you can control the output format. For example, consider:

# script to renice all of a user's processes. You'll often need root rights.
USER=${1:?Missing username}
NICENESS=${2:-10}   # that's 10, - is the bash syntax for default-if-missing
renice $NICENESS -p `ps --no-headers -U $USER -o pid`

# continuously list processes in uninterruptible sleep (see what hits the disk)
# (note: D+ means foreground)
while true
  ps -eo stat,user,comm,pid | egrep '^D'
  sleep 0.1

# script to summarize users that use CPU and memory
ps --no-headers -axeo user,%cpu,%mem | \
  awk '{usercpu[$1]+=$2; usermem[$1]+=$3} 
       END { for (u in usercpu) { if (usercpu[u]>5 || usermem[u]>5) 
             printf("%15s using  %4d%% CPU  and %4d%% resident memory\n", 
                        u, usercpu[u], usermem[u]) }  }'
      postgres using     0% CPU  and    9% resident memory
      www-data using     5% CPU  and   30% resident memory
          root using     9% CPU  and    3% resident memory

Selecting and stopping: pidof, kill, killall

When you don't have a GUI or shell way of killing a program, you'll have to use slightly harsher means when you want to kill a program. The old fashioned way is to get its process id with pidof and then use kill or, failing that, kill -9.

The difference is that the former defaults to the TERM signal (15) which can be received by the process which can choose to shut down nicely - in fact, signals are used for more than termination. (Particular HUP (hangup, 1) has been used as a 'reload configuration files' signal). The signal just mentioned 9, KILL, is untrappable and instructs the kernel to kill the process somewhat harshly. It is the surest way to kill, but means no cleanup in terms of child processes, IO, and such ((verify) which), so should only be used if the default TERM didn't work.

Kill takes a process id, which you can get from top, ps and others., or more directly via pidof. Using killall enables you to use a process name instead of a PID. Summarizing:

# pidof firefox-bin
8222 8209 8208 8204
# kill 8222 8209 8208 8204
# kill `pidof firefox-bin`
# killall firefox-bin

With killall, you do have to match the actual name. For example, if firefox is a script that runs an actual executable called firefox-bin, then killall firefox won't do you much good.

As a regular user you can only kill your own processes, as root you can kill anything but some system processes.

Processes will not die while they are in IOwait. This usually doesn't matter, unless it is blocked on a single call for a very long time. You'll want to take away the thing they are blocked on before they'll die. This may not be a simple task.

nonstandard tools

As in, ones you'll have to install, but are probably useful.


Compared to plain top:

  • more more compact ascii-arty summary of some things
  • searchable
  • slightly handier, e.g. you can select multiple processes and kill all in selection
  • slighty more useful columns
  • more configurable (if you like that kinda thing)



Sort of a variant of top that also summarizes

  • disk IO
  • network IO
  • disk mounts/use

...and colors some key ones on whether they seem to be unduly loaded.

glances tool

I personally like to type less than that. I imitate the psgrep perl script mentioned here and there with a much more basic, fragile, and close-to-what-I-generally-want:

[ -z "$1" ] && exit  #quit when there are no arguments
ps ax | sed -n -e '1 p' -e "/$1/p" | grep -v "\/$1\/" | grep -v psgrep | grep -v 'grep -v'

This will look for the text fragment you type (the first word, it ignores the rest), and give you ps's headers for reference.

The grepping is a hacky, hacky attempt at removing the matching grep and sed processes that are actually this psgrep script itself. The '1 p' unconditionally lets through ps's header.

You can do something similar for top (can be handy e.g. for memory statistics), using its batch mode.

[ -z "$1" ] && exit  #quit if there were no arguments
top -b -n 1 | sed -n -e '/COMMAND/p' -e "/$1/p" | grep -v "\/$1\/" | grep -v topgrep | grep -v 'grep -v'

This too is quite hacky, and note the format top uses depends on its .toprc file. The header line is included by grepping for COMMAND, which I figured is the field most likely to be present.

Recorded statistics

See Network tools. The somewhat wider tools like cacti also store system statistics.



monit is a service that periodically checks a configurable set of properties of a system or process on it.

It makes it fairly easy to check

  • whether common serevices like websites or mail are up - at protocol level
  • practical things like "does the SSL certificate expire within X days"
  • resource use (check for high/unusual load on cpu, network, disk)
  • disk space check
  • network interface link, link speed
  • "if bad the last X checks" to avoid false-positive spam
  • file changes (uid/gid/permission/checksum)

Actions bwyond alerts, e.g.

  • is process still running? If not, restart (e.g. used in docker)
  • execute, e.g. "if log is big, run logrotate", 'if lost IP, restart interface'

See also for a detailed overview

I is primarily set up to send alerts and show overall status, not so much for presenting pretty graphs.

There's a status exportin XML (though not JSON in bare monit)

Monit is free open source software.

M/Monit is a paid-for wrapper that gives prettier graphs, and gives overview for many hosts. See also The pricing is good for datacenters and such, though not for home users.

See also