Influxdb notes

From Helpful
Jump to navigation Jump to search

See also Varied_databases#InfluxDB for context


Data model notes
This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

In comparison to a more relational view...

  • database is a logical container for
users
retention policies
time series data
continuous queries


  • retention policy (RP) contain:
replication factor (copies kept in the cluster) (autogen's default is 1)
retention - how long to keep the data (min 1h) (autogen's default is infinite)
shard group duration - how much data is stored in shards (min 1h) (autogen's default is 7d)
measurements - each measurement is implicitly part of the retention policy you put it in
each database can have one or more RPs
you get a default called autogen (defaults mentioned above)
you'll quickly notice them in addressing (testdb.autogen.measurementname) though can ignore everything about them at first


  • measurement are like a table, containing tags, fields and points
  • series
basically refers to a (measurement,tag) combination you'd likely use in querying -- see below
  • tags are key-value pairs (column-like) you can add per datapoint that are
part of a series' uniqueness, and is indexed
basically, whatever you need for lookup
limited to 64K (and you probably don't want to get close without good reason)
  • field are key-value pairs (column-like) you can add per datapoint that are
not part of its uniqueness, not indexed
basically, any values you may be looking up
types include float (32-bit), integer (64-bit), boolean, timestamp, or string
a measurement takes the type of the first value it gets (and currently cannot be changed except with some creativity[1]), so e.g. forcing integer (add i) over float is sometimes necessary, e.g. to store large values without losing precision
keep in mind that a field will not take different types over time, even if it might be fine, so being consistent per measurement is a good idea. You'll see errors in the log like field type conflict: input field "ping" on measurement "net" is type integer, already exists as type float
strings possibly not limited to 64K? (I've seen conflicting information)
but you probably don't want to use influxdb as a blob store if you want it to stay efficient
you can check current type with show field keys
  • (data) points
always have a time(verify)
there is always an index on time(verify)
time precision is a specific detail you can control



Typical use of measurements, series, tags

Say you want to start keeping track of CPU use and are collecting it for various datacenters (various tutorials use an example like this).

You might have a

  • specific database for this purpose (for admin reasons)
  • retention policy mostly because you want monitoring stuff deleted after a year without thinking about it
  • measurement called host_monitor

and want to enter a datapoint with

  • tags like hostname=node4,datacenter=berlin,country=de
  • fields like cpu0=88,cpu2=32


You'll notice this is a pile of everything CPU-related.

Tags are usually structured with common uses in mind, often the coarsest and finest things you anticipate querying on - you can e.g. pick out/filter and so summarize per country, or pick out a particular host if needed (and you still need a combination of tags - hostnames are likely only unique within datacenters).


Series are basically the idea that each unique combination of (measurement,all_tags) represents a series.

There is no on-disk data structure of a series per se -- data you send in from different places will often imply unique series, through having unique tags, though to some degree they are more of a querying concept, and a storage one only insofar that the indexing helps that.(verify)


On point uniqueness

A point is unique if it has a distinct (measurementname,tagset,timestamp) combination, so if you write when a record with that tuple already exists, field values are merged/overwritten.

Depending on the timestamp precision you hand into the ingest url, such overwriting...

  • may never happen, if it's based on 'now'
  • may happen but never be an issue, if the precision is much higher than the interval you send in
  • may be something you do intentionally


On timestamp precision

Timestamps are currently nanosecond resolution by default.

This can be reduced to microsecond, millisecond or second.


Lower-precision timestamps lead to

  • a little more on-disk compression[2]
  • overwrites data with the same timestamp (see previous point)

FIGUREOUT:

  • Is that per database, series, datapoint at insertion time?
  • Does it mix precision if you alter precision over time?



API

https://docs.influxdata.com/influxdb/v1.8/tools/api/#influxdb-1-x-http-endpoints


/query queries, management
/write ingest, takes line protocol
/ping health and version
/debug/pprof Generate profiles for troubleshooting
/debug/requests Track HTTP client requests to the /write and /query endpoints
/debug/vars Collect internal InfluxDB statistics


The line protocol[3] is a one-liner text presentation that looks like

measurement,tag_set field_set timestamp

where

tag_set and key_set are comma-separated key=val pairs
timestamp is nanosecond-precision Unix time
(also optional; defaults to local timestamp, UTC, but be aware of
clock drift (so you likely want NTP)
timezones (so have a serious think about using either client time or server time))

Clients may ease conversion of structured data to line protocol.


On shards and shard groups

Under the hood, data is stored in shards, shards are grouped in shard groups, shard groups are part of retention policies.

This is under-the-hood stuff you don't really need to know, though it may be useful to consider in that shard groups are related to

  • the granularity with which old data is removed because of retention policy (it's dropped in units of shard groups - so never immediately)
  • efficiency of the typical query, e.g. when most queries deal with just the last one and the rest is rarely touched, essentially archived and rarely or never touched by IO


Config notes

Getting influx to log is great for debugging, but is both very verbose and (unless all its clients POST instead of GET) puts possibly-sensitive information in system logs, in which case you probably want to set log-enabled = false in [http] in influxdb.conf


Note that if you put a HTTP server/proxy in front, logging same may apply there as well.

If apache, consider at something like

 SetEnvIf Request_URI "/submit" dontlog 
 CustomLog /var/log/apache2/access.log combined env=!dontlog


Security notes

Host security

Because it's designed to be clustered, it serves on all interfaces by default (and names should be resolvable).

On a single-node installation you could choose a localhost-only via bind-address in the [http] section, which you'd want as

bind-address = "localhost:8086"    # default is ":8086"

(There used to be an admin web interface on port 8083, but this has been removed[[4]]. You now probably want to use Chronograf)


Auth

For similar reasons, by default there is no authentication[5]

you may wish to firewall things at IP level
if you want auth, you need to enable security and create users (see below)
auth happens at HTTP request scope, e.g. for the API and CLI
certain service endpoints are not authenticated
  • can do HTTPS itself [6]
you can get server certs and client certs - and use self-signed ones if you wish
note that in a microservice style setup, you may wish to do this on the edge / ingest sides instead


User authentication

Enable: set auth-enabled=true in the [http] section and restart

You can

  • use basic auth
  • hand in username and password in the URL or body
  • JSON Web Tokens

If nonlocal, it's recommended you use HTTPS, because all of these options are effectively plaintext.


User authorisation

New non-admin users have no rights. They can be given

READ,
WRITE, or
ALL (meaning READ and WRITE)

per database


New admin users have a lot more granularity, like

  • CREATE DATABASE, and DROP DATABASE
  • DROP SERIES and DROP MEASUREMENT
  • CREATE RETENTION POLICY, ALTER RETENTION POLICY, and DROP RETENTION POLICY
  • CREATE CONTINUOUS QUERY and DROP CONTINUOUS QUERY
  • user management
Querying notes

Query languages

InfluxQL - an SQL-like language [7]

Flux - a more featured language [8]


InfluxQL examples (you may want to run the cli, e.g. influx -precision rfc3339 where that argument is for human-readable time formatting):

USE testdb

A simple query would be

SELECT "eth0_rx", "eth0_tx" FROM "pc_monitor"

would be all data of that series that we have

Queries like

SELECT "cpu_used" FROM "pc_monitor" WHERE time > now() - 15m

but when you want a timeseries you often want to regularize it like:

SELECT mean("cpu_used") FROM "pc_monitor" WHERE time > now() - 15m GROUP BY time(1m) fill(null)

This adds a time interval (1m), what to do with multiple values (aggregate into the mean)


On fill(): GROUP BY time() creates regular intervals(verify), so it has to do something for intervals with no data. Options:

null: return timestamp with null value (default)
none: omit entry for time range
previous: copy value from previous time interval
linear: linear interpolation

[9]


Getting the most recent value

Consider:

SELECT last("cpu_used") FROM "pc_monitor" WHERE time > now() - 1h

Notes:

last() aggregate is what it sounds like
you want a time limit, to avoid selecting the entire time series for that to
you probably want that anyway, when you care to view something current


For a gauge, you may want a recent average, like:

SELECT mean(cpu_used) FROM "pc_monitor" WHERE time > now() - 5s group by time(5s) ORDER BY time desc

Notes:

  • because this compares against now, the most recent interval that GROUP creates may not have a value in it yet, meaning you'll get


SELECT LAST(eth0_tx) from pc_monitor


SELECT LAST(field_name), * from test_result GROUP BY *


GROUP BY * effectively separates by series

SELECT LAST(*) from pc_monitor group by *

Practically similar to

SELECT * FROM "pc_monitor" WHERE time > now() - 15s ORDER BY time desc limit 1

Though you may like some averaging, like

SELECT mean(cpu_used) FROM "pc_monitor" WHERE time > now() - 5s group by time(5s) fill(none) ORDER BY time desc limit 1

Note that without the ORDER BY time desc limit 1 you'ld probaly get two time periods (at least until the group time is at least twice the selection time)




Dealing with null



See also:

Management notes

Deleting data

If this is about removing too-old data, the never-think-about-it approach is to set up retention policies.


...but yes, you can do things like:

DROP MEASUREMENT "net"

Notes:

all data and series from a measurement


DROP SERIES FROM "net" WHERE hostname='myhostname'

Notes:

drops all series that apply


DELETE FROM "net_monitor" WHERE hostname='myhostname' and time < now() - 1h

Notes:

the delete granularity is effectively measurements (not tags) (verify)
this won't delete the series, even if it removes all points


DROP SHARD shardid

Notes:

you'd probably get the shard id from show shards


https://docs.influxdata.com/influxdb/v1.5/query_language/database_management/#delete-series-with-delete

https://github.com/influxdata/influxdb/issues/8088

https://community.openhab.org/t/influxdb-clear-old-records/88442/4


Browsing data

Use the CLI, something like something chronograf or grafana.

There used to be an interface [ https://docs.influxdata.com/influxdb/cloud/query-data/execute-queries/data-explorer/ this?]


CLI example:

> SHOW DATABASES
_internal
foo
> use foo
Using database monitor
> show series


Backup and restore

influxd backup -database name -portable backup.data


Storage size

Because of the compression done to older data, and the often-quite-compressible nature of time series, most monitoring needs don't really need to worry about space use.

This of course does scale with the amount of counters, and the time resolution of insertion.

For example, in a 70 day test with dozens of counters inserting in intervals between 2 to 300 seconds, space sawtoothed (because the compression is staged) up 200MB, from 500ish to 700ish.

Not nothing, and on embedded you probably still want rrdtool, but also nothing to worry about on, say, raspberries or small VPSes.

Chronograf notes

Separate install and binary, so needs to be pointed at a InfluxDB instance


https://docs.influxdata.com/chronograf/v1.8/introduction/installation/