Data logging and graphing

From Helpful
Revision as of 23:50, 16 March 2021 by Helpful (Talk | contribs)

Jump to: navigation, search
Linux-related notes
Linux user notes

Shell, admin, and both:

Shell - command line and bash notes · shell login - profiles and scripts ·· find and xargs and parallel · screen and tmux ·· Shell and process nitty gritty

Linux admin - disk and filesystem · Init systems and service management (upstart notes, systemd notes) · users and permissions · Debugging · security enhanced linux · health and statistics · kernel modules · YP notes · unsorted and muck

Logging and graphing - Logging · RRDtool and munin notes
Network admin - Firewalling and other packet stuff ·

Remote desktops
VNC notes
XDMCP notes

RRDtool notes


RRDtool mostly consists of

  • tools that store values into a round robin database
meaning a fixed amount of samples are stored, all the space needed is allocated up-front, it keeps a pointer of what is the most recent value, and overwrites the oldest value.
since it's really just one large array makes it compact to store
the fixed size makes sense on space limited devices, e.g. embedded
  • tools to graph that data

often used to keep track of various parameters of computer health and use (e.g. cpu use, free space, network load, amount of TCP connections). There are packages that ease its use, e.g. munin in general and ganglia on clusters/grids.

RRDtool is geared to storing values for regular time intervals.

Gauges, rates, counts; on summarizing discrete events

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

Use GAUGE for values that are instantaneous snapshots

Collection interval depends a bit on how transient possible peaks are, and how much you care to find them.

Slower is fine for things that change slowly (e.g. temperature of a room), or when you only care about larger trends (possibly CPU temperature) and not unusual peaks (possibly CPU temperature).


temperature (fine, changes slowly)
disk space use (fine, changes slowly)
amount of open connections (could miss rapid peaks)

When you fetch data from an incremental counter, use COUNTER/DERIVE/ABSOLUTE


bytes moved by disk
bytes transferred by network
(often available as both total and a recent per-second rate)
blink-per-kWh counter
liter-of-water tick counter
webserver visits (depending on how you count/extract that count)

Counters that are slow are less intuitive, e.g. a liter-of-water tick is since the last tick, which may be long ago. (Technically also true of kWh, but that's often spent on the scale of seconds or minutes, so pretending it was all recent is good enough for most uses)

Since the event counting is done for you already, you won't miss any peaks, though a low interval still smears them over - you're essentially storing average rate.

The graphs will show these counts as rates.

Interval still matters in how much you want to see or locate peaks. E.g. with 1-minute interval of transfer speed, you cannot tell the difference between [100MB/s for 1sec + 59 seconds of nothing] and [sustained 1.6MB/s for 1 minute].


    • For values that can be graphed as they are: instantaneous values, not rates, and probably slow-changing.
    • Graphing does not at all involve the concept of rate.
    • to feed through counters that never decrease -- except at wraparound: at which time the the value is calculated assuming typical wraparound (checking whether it's 32-bit or 64-bit integer overflow)
    • entered value is difference with last value(verify)
    • max (and sometimes min) is useful to avoid insertion of values you know are impossible
    • stored value: per-second rate
    • Basically COUNTER without the only-positive restriction and without the overflow checks
    • min and max are useful to avoid insertion of values you know are impossible
    • like counter, but with the implication that every time you hand a value to rrdtool, the counter you read from is reset.
    • Useful to avoid counter overflows, and the trouble of figuring out the right value at wraparounds.


  • If you basically want COUNTER, but want to deal with resets of your source values to be stored as UNKNOWN rather than possibility mistaking them for counter wraps or large values, then you may prefer DERIVE with min=0
  • There is little point to sampling faster than your typical event interval.
And, depending on the case, you may as well go for an order of magnitude slower, particularly for graphing.
that is: if we sample faster than events typically happen, we may have only pretense of time resolution, which makes rate-per-interval graphs become less meaningful
Example: You have a waterflow meter that ticks once per liter. You sample it every five seconds. The last liter was probably used in the last minute, not the last five seconds, but that's where it is recorded, and that's where it will be graphed. (You can even start considering that taps, showers, and such are often limited to ~3 liters/minute)

Notes on RRD, DS, RRA

A RRD (round robin database) refers to a .rrd file.

A DS (data source) is the named thing that you send values into (using
rrdtool update

An RRD can contain more than one DS.

An RRA is basically a consolidation log with given settings - and one DS can have multiple RRAs. (why?(verify)) For example, you could do:

DS:cpuload:GAUGE:90:0:100 \
RRA:AVERAGE:0.5:1:1440 \
RRA:AVERAGE:0.5:7:1440 \
RRA:AVERAGE:0.5:30:1440 \

The --step value, combined with each RRA's amount-of-samples-to-consolidate value determines how many recent pre-consolidation samples are also kept in the .rrd file. (verify)

In simpler cases you can have one RRA per DS. Usually there are multiple DSes per RRD (different time series, e.g. disk space for each of your disks, the various types of CPU use side by side).

Creating RRDs - 'rrdtool create'

Creates a round robin archive in the named file(==round robin database).

Actually, a single .rrd can contain multiple data sources, and this can be handy if you have several directly related types of data or exact timing is important. I personally prefer a single data source per file, purely because this is slightly easier to manage.

A data source consists of:

  • an identifying name
  • a type (note: mostly has effect on graphing)
    • gauge value (e.g. for temperature; not time-divided when graphed): GAUGE
    • incremental (e.g. 'bytes moved by network interface', 'clock ticks spent idling'): COUNTER
    • rate (e.g. 'bytes moved the last minute'; assumed to be reset on reading): ABSOLUTE
  • the heartbeat, which is an maximum time between incoming values. If exceeded, the value will be stored as 'unknown')
  • a minimum and maximum value, as protection against stray erroneous values; if exceeded, 'unknown' will be stored

The round robin archive the data source(s) get recorded in also has to be specified. The information it wants consists of:

  • the 'consolidation function' : Essentially a function applied to the bracket of collected data before it is consolidated. This can be used to store several types of statistics on one data source, for example the minimum, maximum, and average, useful eg. when measuring temperature. On simple counters, just 'average' is enough.
  • the 'x files factor' : The idea is that since if there are too many unknowns, sometimes the consolidated value should be unknown too (e.g. to avoid spikes because you're averaging over too few values) To make sure values make sense, you can require at least some fraction of the data points in a consolidation interval to not be undefined. The value you should use here actually depends on what you're recording.

Example: hard drive temperature

Say you take a value every fifteen seconds, want to store a value for every thirty seconds, and keep a week's worth of data data.

To create the archive, you need to figure out

  • how many data points go into a consolidation - here 2 (two 15-second values into one consolidated 30-second value)
  • how many past consolidated values to store: 2880 half-minutes/day times 7 days/week, is 20160 data points

For example:

rrdtool create test.rdd --steps=15 DS:test:GAUGE:25:U:U RRA:AVERAGE:0.5:2:20160

The most important numbers in there are the 15, the 2, and the 20160.

I personally like to think about consolidation from the other direction, considering:

  • how often will I update (15 sec)
  • how big is the interval that will be consolidated into (30 sec) - has to be a multiple of the update interval
  • how big a backlog do I want, say, 1 year

In more detail...

  • RRA:AVERAGE:0.5:1:20160
    • RRA - 'round robin archive'
    • AVERAGE - The aggregate function, to use in consolidation.
    • 0.5 - x files factor
    • 2 - steps; how many samples to consolidate
    • 20160 - amount of samples
  • DS:test:GAUGE:50:U:U
    • DS ('data source')
    • test - the name within the archive. You can create more than one DS in a .rrd archive.
    • GAUGE - type; see above. Temperature is a typical gauge value.
    • 25 - the heartbeat. I like to have this a bunch of seconds larger than the interval, to avoid UNKNOWNs purely because a script that does a lot of rrdtool updates takes a while. Not strictly the most accurate, but hey.
    • U:U - the minimum and maximum values - an attempt to add a value outside this range is considered invalid. Sometimes this is useful to filter out nonsense values that could be generated by a plugin. I tend not to use it.

Graphing - 'rrdtool graph'


rrdtool graph /var/stat/img/load.png \
 -w 200 --title "Load average" -E -l 0 -u 1 -s -86400 \
 DEF:o=/var/stat/data/load1.rrd:load1:AVERAGE \
 DEF:t=/var/stat/data/load15.rrd:load15:AVERAGE \
 CDEF:od=o,100,/ \
 CDEF:td=t,100,/ \
 AREA:td#99999977:"15-minute" \

This shows loading data sources, basic calculating, and different draw types:

  • DEF loads a data source from a file, and assigns it a name.
  • CDEF is somewhat advanced, but necessary in this case: rrdtool stores integers, so the stored data in this case is 100*(the real real load factor), and here's where you can divide it back. This creates an extra graphable variable. (The expression syntax is reverse polish)
  • AREA and LINE are two common ways to graph data. You usually specify what named variable to graph, the color, and the title. The color is hex, and can be RGB or RGBA. The example AREA is somewhat transparent.

Options used here:

  • -title "foo" sets the tittle of the graph
  • -s -86400 sets how many seconds back to graph from (-86400 is a day)
  • -l 0 and -u 1 set (non-rigid) lower and upper bounds on the y scale. Zero and one will be useful for e.g. load average: the axis will scale up to larger values, and will not sometimes show e.g. 200m, 400m, 600m (as in milli) when the largest value is rather small. This can also be useful to e.g. set a memory graph's upper bound to how much RAM you have. Note that to force this range rather than to make it only work when the data lies within that range and not beyond, you also need -r (--rigid). There are other autoscaling alternatives that may fit other uses better.
  • -c: specify overall colors, such as
    -c BACK#ffffff -c SHADEA#eeeeee
    (see docs for those names)
  • -w and -h makes the resulting picture a specific pixel width and height.
  • -E: gives prettier lines instead of the truer-to-data blocky ones.
  • -b 1024 to make 'kilo' and 'mega' and such be binary-based, not decimal-based, which is useful for accurately showing drive space and such.



Munin notes

Munin wraps a bunch of convenience around rrdtool, and makes it easier to aggregate from many hosts.

A munin setup has one master, and one or more nodes.

Nodes report data when asked by a master.
A master asks the nodes for data and creates graphs and such.

In a single-node setup, a computer reports on itself, so that host is node and master. You will probably not care about the fancy networked options in this case.

(A node could also be reporting to multiple masters. This could in theory be useful if you want to generate a report on a workstation, and have a centralized overview of workstations.)

On using plugins

There is a "plugins to actively use" directory (usually /etc/munin/plugins).

Its entries are usually symlinks to larger "plugins you have installed and could use" directory. (location varies, e.g. /usr/share/munin/plugins/ or /usr/libexec/munin/plugins/)

Plugins are picked up after you restart munin-node. (sometimes earlier?(verify))

Manual setup

If I wanted to use the ntp_offset plugin, I might do:

cd /etc/munin/plugins
ln -s /usr/libexec/munin/plugins/ntp_offset  ntp_offset

When a plugin name ends with an underscore, the text that follow is an argument. Often done to ease separate instances of the same plugin script, for example: Examples:

cd /etc/munin/plugins
ln -s /usr/libexec/munin/plugins/ip_     ip_eth0
ln -s /usr/libexec/munin/plugins/ip_     ip_eth1
ln -s /usr/libexec/munin/plugins/ping_   ping_192.168.0.1
ln -s /usr/libexec/munin/plugins/ntp_    ntp_192.168.0.1

Note that other configuration (e.g. database logins) still goes into munin configuration files.

Semi-automatic setup

You can ask munin to suggest plugins that will probably work. You can get a summary with:

munin-node-configure --suggest | less

I prefer to get this in a more succinct form -- namely the commands that activate the plugin:

munin-node-configure --suggest --shell | less

If you want your own plugin to report itself here, look at the autoconf and suggest parameters. (and magic markers)

Testing and troubleshooting

If running a plugin fails but you don't know why, look at the logs:

tail -F /var/log/munin/*.log

...and see what errors scroll past

you can wait for the next system run
or invoke it yourself, something like
sudo -u munin munin-cron
(probably varies somewhat between systems, and this messes with the sampling somewhat so see below for a subtler variant)
For example, my apache_volume was broken because I'd set a wrong host in the configuration, and the log showed the timeouts.

If there's no error in logs, you may want to edit the plugin script.

For example, the hddtemp script does a 2>/dev/null, meaning silent failure on e.g. permission errors.

Sometimes it helps to run a specific plugin.

While you can run it directly, it's more representative if munin runs it, particularly if the problem is permissions or environment, but also because some plugins receive some config from munin itself.

It seems the best check is to ask the munin daemon directly (config and/or fetch), e.g.

 echo "config mysql_queries" | netcat localhost 4949
 echo "fetch mysql_queries" | netcat localhost 4949

You should see the plugin's config or value output, e.g.:

delete.value 16827
insert.value 15415
replace.value 1201
select.value 1594114
update.value 144234
cache_hits.value 1017162

A decent second choice is to make munin itself run the plugin

sudo -u munin  munin-run mysql_queries

If there is any difference between using munin-run and netcat, that points to environment stuff, e.g. PATH or suid or whatnot. Look at the munin's central plugin config file(s).

logs say "sudo: no tty present and no askpass program specified"

See Linux_admin_notes_-_users_and_permissions#no_tty_present_and_no_askpass_program_specified for the general reason, and generally the better fixes.

One way around the issue is to not use sudo, but configure munin to run the plugin as root.

In /etc/munin/plugin-conf.d/munin-node add something like:

[<plugin name>]
user <user>
group <group>

Not seeing a graph

Usually because there is not a single datapoint stored data yet. (logs will say something like rrdtool graph did not generate the image (make sure there are data to graph))

If it doesn't show up after a couple of minutes, the plugin is probably failing - look at your munin-update.log

Can't exec, permission denied


  • directory not world accessible (by munin)
  • lost execute bit
  • TODO

On intervals

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

It seems like:

  • the storage granularity is 5 minutes
  • the cronjob updating values is typically on a 5-minutes because of it
Recent munin (master since 2.0.0) have per-plugin(-only)
(in seconds), the interval in which it expects values
note that changing it effectively asks to (delete and) recreate the RRD files.(verify)

Sub-minute is possible but probably means you need something other than cron for the updating.

On writing plugins

A plugin is any executable that reacts at least to being run in the following two ways:

  • without argument - the response should be values for the keys that the configuration mentions to exist, like:
total.value 4
other.value 8
  • With one argument, config, in which case it should mention how to store and how to graph the series. A simple example:
graph_title Counting things
graph_vlabel Amount
total.type  GAUGE
total.label Total
other.type   GAUGE
other.label  Other   Things that do not fall into other categories
other.draw   AREA
other.colour 336655

The names of series (e.g. the other and total above) must conform to, basically, the regexp [A-Za-z][A-Za-z0-9_]*.

Example plugin I wrote some time:

""" reports the amount of logins per user, based on 'w' command, 
    Always reports all seemingly-real users, so that history won't fall away when they are not logged in.
import sys
import os
import subprocess
def user_count():
    counts = {}
    # figure out users to always mention (as 0)
    passwd = open('/etc/passwd') 
    for line in passwd:
        if line.count(':')!=6:
        user,_,uid,gid,name,homedir,shell = line.split(':')
        if user in ('nobody',): # often useful to leave in here
            counts[user] = 0
        if 'false' in shell or 'nologin' in shell: 
            continue # skip not a user who can log in
        if not os.access(homedir,os.F_OK):
            continue # skip, homedir does not exist
        if '/var' in homedir or '/bin' in homedir or '/usr' in homedir or '/dev' in homedir: 
            continue # skip, mostly daemons
        counts[user] = 0
    proc = subprocess.Popen('w --no-header -sui', stdout=subprocess.PIPE, shell=True)
    out,_ = proc.communicate()
    for line in out.splitlines():
        username = line.split()[0]
        if username not in counts:
            counts[username]  = 1
            counts[username] += 1
    for username in sorted(counts):
        ret.append( (username, counts[username]) )
    return ret
if len(sys.argv) == 2 and sys.argv[1] == "autoconf":
    print "yes"
elif len(sys.argv) == 2 and sys.argv[1] == "config":
    print 'graph_title Active logins per user'
    print 'graph_vlabel count'
    print 'graph_category system'
    print 'graph_args -l 0'
    for user, count in user_count():
        safename = user.encode('hex_codec')
        print '_%s.label %s'%(safename,user)
        print '_%s.draw LINE1'%(safename) # could do AREA/STACK instead, but this works...
        print '_%s.type GAUGE'%(safename)
else: # not config: current values
    for user, count in user_count():
        print '_%s.value %s'%(user.encode('hex_codec'),count)

Some of the things you can use:

  • graph_title
  • graph_args
    - (user-determined) arguments to pass to rrdtool graph - say, --base 1024 -l 0
  • graph_vlabel
    - y-axis label
  • graph_scale
    - (yes or no) Basically, whether to use m, k, M, G prefixes, or just the numbers as-is
  • graph_info
    - basic description
  • graph_category
    - If you can fit into an existing category, use that. Otherwise use other (which is the default if you don't mention it). You can't use arbitrary strings.
  • seriesname.type
    - one of the rrdtool types (GAUGE, ABSOLUTE, COUNTER, DERIVE)
  • seriesname.label
    - label used in graph legend. Seems to be more or less required.
  • seriesname.draw
    - how to draw the series (think LINE, AREA, STACK). Defaults to LINE2(verify)
    - description of series (shown in reports)
  • seriesname.colour
    - what colour to draw it with. By default this is automatic
  • seriesname.min
    - minimum allowed value (insert-time)
  • seriesname.max
    - maximum allowed value (insert-time)
  • seriesname.warning
    - value above which a value should generate a warning
  • seriesname.critical
    - value above which a value is considered critical
  • seriesname.cdef
    - lets you add rrdtool cdefs
  • seriesname.negative
    - (cdef-based?) quick hack to easily mirror a series on the
  • seriesname.graph
    - whether to graph it,
    . Can be useful in multigraph.


  • if you want something to show up in the legend but not the graph, seriesname.type LINE0 may be simplest
  • it looks like values for unknown variables are ignored (verify)
  • for multiple graphs from one plugin, basically just mention
multigraph name
as a separator between complete configs - and between the generated value sets.

  • When debugging drawing details and you don't want to wait five minutes for the graph to update, run something like: (full path because it's probably not in your PATH. It may be somewhere else.)
sudo -u munin /usr/share/munin/munin-graph

  • The introduction mentions that config is read when the plugin is first started.
It also does so when you tell the service to reload(verify), and it looks like munin-update may do so at every run(verify)
  • You can change the config later, adding series (munin will automatically create storage for any new series it sees) and removing series (backing storage will stick around unused), though you probably want to be consistent about the names.
    • Note: If series appear and disappear in your config, their history will appear or not depending on whether they were mentioned the last time the images were generated -- which can be confusing.
For example, I wrote a 'show me the largest directories under a path' plugin, but that meant that any directory that moved would be removed from all graphed history (which also shows more free past space than there really was).
For other cases you can fix this easily enough, e.g. 'login count per user' can look at /etc/passwd and list everyone who seemingly can log in.
  • there are other possible arguments to the plugin executable, such as autoconf and suggest for the "suggest what plugins to install" feature. So don't assume a single argument is config

On keeping some state

On config and environment

Munin's plugin-conf.d/ (usually at /etc/munin/plugin-conf.d/ fixes up any necessary environment for plugins - often primarily centralized configuration.

(note: files are read alphabetically. In case that matters, which it generally doesn't)

It's not unusual to see a lot of stuff in a single file called munin, and a few not-so-core plugins having their own files.

For syntax, see [1] The most interesting are probably:

  • user
    ): run plugin itself as this user/group
easy for an admin to change this structurally
  • env.something
    sets that variable in the process's environment
For example default warning/error levels that apply to all instances
For example database password - good for multiple instances, and can be useful permission-wise

On the report

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)
To change the categorisation of the graphs, you're probably looking for

A plugin often doesn't explicitly have it; you can add one. It used to be a fixed set, but since a few years ago you can invent your own categories (verify)

The CGI zoom graph will require you to hook in the munin-cgi-graph CGI script.


  • that script is in /usr/lib/munin/cgi/
  • you put munin in its own vhost (relevant to where to mount it, path-wise, below it's on /)
  • you have either the FastCGI or more basic CGI module

Then you'll need to add something like

ScriptAlias   /munin-cgi/ /usr/lib/munin/cgi/
<Directory "/usr/lib/munin/cgi">
  Options +ExecCGI
  <IfModule mod_fcgid.c>
    SetHandler fcgid-script
  <IfModule !mod_fcgid.c>
    SetHandler cgi-script
<directory /usr/lib/munin/cgi>
  Satisfy Any

On the CGI (manual-input) graph

If it doesn't work, you've probably done the same thing I did: just placed the html output somewhere it is served.

The CGI graph (munin-cgi-graph), well, needs CGI. Look at:

The URL path seems to default to /munin-cgi/somethingorother

Ganglia notes

Made to monitor the same things on many compute nodes in a cluster. Built on RRDtool.

Nagios notes

Grafana notes

Graphite notes

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

Graphite can be seen as an easier and more scalable alternative to rrdtool. (in that it consists largely of round-robin time-series and mostly line graphs)

Written in python. Consists of:

  • data collection
can feed in over network with simple syntax

  • carbon, which aggregates data
    • data collection scripts send their data here
    • carbon is responsible for storing it in whisper
    • will cache data not yet written (fetchable by the webapp, to help the real-time aspect)
  • whisper, a round-robin database
    • Much like a rrdtool database, but slightly more flexible, and its IO scales a little better when you want to record a lot of different things (even though it uses a little more CPU time)
  • webapp
    • Served via Django (which the creator runs under mod_python, but you may like your own way)
    • fetches from whisper, and carbon-cache
    • calls graphing via Cairo (pycairo)
    • user interface with ExtJS
  • can use memcached

See also:

Cube notes

Datadog notes