Rate limiting, traffic shaping, and such
Rate limiting
Backpressure
On analogies
Backpressure routing
Linux traffic shaping notes
Introduction
Linux traffic shaping = throttling network data, either by queueing, or dropping the occasional packet (the latter makes sense to make TCP back off, actually a decent way to avoid it affecting latency).
The goal is often to get the highest throughput with the lowest latency. (Standard caveat: This only makes sense if you are controlling something that is actually bottlenecked, because if not you either don't affect the actual bottleneck, or can only do so via overkill (and hoping you were the most important influence).
You configure a tree (per interface) of behaviours.
Traffic enters this tree at the root, and is routed according to filtering until it ends up at a leaf of the tree.
a filter
- must have a classifier, which is used to select packets (and typically route it to a specific qdisc).
- may have a policer, which can do things like 'if exceeding this rate, drop a packet'. (Seems rarely used, presumably too heavy-handed?)
There are a few classifier types:
- u32 is the most common (can inspect all of a packet)
- fw is useful when you use iptables to mark the packet's metadata
- tcindex is useful only for DSMARK
- others are route, rsvp, rsvp6
Tree elements are often configured with things like:
- a capacity, which acts as a hard limit for the element
- a user-set bandwidth that this element should target,
- a particular resolution of control - long-term allows bursting but will make longer transfers oscillate between fast and slow, while short-term is a harder, more immediate limit
- options that control e.g. whether elements may borrow any unused leftovers from sibling nodes
You could, for example, think up the following:
capacity target speed + root |--+ local: 100Mbit no limiting | `--- nfs 80Mbit no bursting (NFS is known for congesting networks, so avoid that) `--+ inet : 1Mbit |--- voip: 1Mbit no limiting |--- http: 320Kbit 30KByte/s |--- ftp: 320Kbit 20KByte/s `--- fileshare: 120Kbit 15KByte/s, no bursting
I intentionally mixed Kbit(/s) and KByte(/s). I'm pointing out you should be aware of this to avoid making mistakes :)
iptables and tc
Avoid l7 rules as sparingly as possible. They are complex and slow when compared to the rest of the network stack, so will actually increase latency a little. Still, they can help e.g. guarantee various VOIP always gets preference, something that is hard to do without.
Some terms
egress means outgoing data
ingress means incoming data
traffic shaping tries to squeeze the bandwith
traffic policing checks whether bandwidth use is in accordance with a traffic contract
- usually that of a single link, because those are the only ones you can negotiate a contract
qdisc - queueing discipline, an "insert specific scheduler algorithm here" thing
- The choice of algorithm is important if you want some specific behaviour
- classful/classless is about configurability:
- classless qdiscs - those whose behaviour is hardcoded and not configurable beyond basic rate figures. (They may well have multiple flows and be somewhat smart, but since you cannot add filters, classes, or attach child qdiscs, they are leaves in the tree and tend to serve a fairly simple final purpose
- classful qdiscs are used to create the major logic in the tree. Filters appear only inside classful qdiscs and handle the classification of packets into one of the qdisc's classes.
Classes are the configurable inner workings that handle different flows of data. Each such class may have some properties according to the qdisc it is in (e.g., individually throttlable flows), and also contains a child qdisc.
Each class may be
- isolated, meaning it will not (try to borrow or) give away bandwidth to sibling classes
- bounded, meaning it will not borrow bandwidth from sibling classes
bursting means you may exceed the limit on the short term. Usually indicates that a qdisc observes the target rate in the long rather than the short term.
Common parameters:(verify)
- rate roughly means the targeted rate. Under differently taxed conditions and depending on your design, this can act as a cap and a rough guarantee.
- ceil caps the rate under borrowing conditions (value seems to default to parent ceil)
- burst caps the amount of bytes bursted at 'ceil' speed (per jiffy - note the tbf-like limit this imposes; you may have to raise this under some conditions)
- cburst caps the amount of bytes bursted as fast as the interface can handle it. Setting this high lowers equalizing properties.
Qdisc types
Qdiscs serve varying purposes. They have man pages, see apropos tc- for a list
Classless qdiscs
- pfifo_fast
- Used for ToS purposes: the (sixteen) ToS values are mapped to three bands (where a band is basically a FIFOs), namely interactive (0), inbetween (1), and bulk (2).
- When there are packets waiting on more than one band, packets from lower-numbered bands are always emitted first.
- if ToS fields are not used, this acts entirely like pfifo
- so you may want to firewall rules to mark stuff
- bfifo - Byte-based FIFO
- pfifo - packet-based FIFO
- Usually the default qdiscs in classful qdiscs(verify)
- Mentioning 'fifo' in a configuration seems to mean bfifo(verify)
- sfq
- 'Stochastic Fair Queuing' equalizes (does not limit) individual flows (mostly TCP connections and UDP streams), though cheap-and-decent round-robining, meaning fairness when the qdisc is saturated.
- The 'perturb' option lets you specify how often it should change the details of the hash, which makes it more likely that distribution works well in the long run.
- can be useful as a leaf for paths where you anticipate many concurrent connections
- esfq
- a variation of sfq that allows more control over the band decision process (and over some other details)
- tbf
- Token Bucket Filter is useful when you need a simple hard bandwidth cutoff, and allow no real bursting either.
- The mentioned tokens are virtual 'you may send out a packet now' tokens that are replenished at a configured rate (and with a maximum amount).
- Apparently uses one token (i.e. sends one packet) per jiffy (where that jiffy depends on the kernel configuration, often 1/100, 1/1000 or 1/125 of a second)(verify)
- Note that that kernel rate both controls the actual meaning of the outgoing rate, and implies a maximum speed (sometimes rather lower than the speed an interface is capable of)
- not afraid to drop packets (depending on the limit parameter)
- gred, "Generic Random Early Drop
- Apparently more for backbones, and not important to home networks
Classful
- prio
- Same idea as pfifo_fast, but you can configure the bands and TOS mapping yourself.
- cbq
- 'Class Based Queuing' is a featured averaging shaper that keeps to the limit set over a longer period of time.
- htb
- Hierarchical Token Bucket is one of the more controllable methods
- resembles tbf but in a simple hierarchy in which the parent can give tokens to the children.
- (seems to queue a bit before it drops packets(verify) - more than than tbf, hfsc anyway)
- hfsc
- Hierarchical Fair Service Curve can be seen as a fancier HTB
- It classifies, has limits (and will drop to enforce them), has more control over sharing, and also allowing "this has priority over everything else" rules (realtime).
- Well-tuned, this lets you kep interactive latencies while using most of your bandwidth
- dsmark
- refers to Differentiated Services[1], a field in the IP header
Notes on hsfc
See also:
And perhaps:
- http://manpages.ubuntu.com/manpages/precise/man7/tc-hfsc.7.html
- https://gist.github.com/bradoaks/940616
- http://manpages.ubuntu.com/manpages/raring/man8/tc-hfsc.8.html
- http://unix.stackexchange.com/questions/96494/about-hfsc-parameters
Related tools
monitoring
bmon is aware of qdiscs and can help visualize which ones are routing your traffic.
A quick and dirty solution:
#to watch the status of the classes: watch -n 1 -d tc -s class show dev eth0 #and to see the qdisc structure: watch -n 1 -d tc -s qdisc show dev eth0
Notes:
- the dev eth0 shows one device instead of all.
- ls is equivalent to and less typing than show
- a tc -s is also a way to see if you've got tc support. You'll likely see a pfifo_fast (per device) when you have it.
tc
You can write rules for the tc command, which comes from iproute2 and is little more than a hook to kernel calls.
units
Example
Leaves the local net unthrottled, limits HTTP only somewhat, limits SSH more (because of SCP and SFTP), and throttles everything else down pretty hard.
This assumes your upstream is ~100KByte/s. We intentionally stay a little under that.
This setup will affect every service you don't list here, so in practice you may want to write it the other way around: specifically slow down anything you don't want to be disruptive, and having the catch-all rule be barely throttled or unthrottled.
dev "eth0" { egress { class ( <$local> ) if ip_dst:16 == 192.168.0.0; class ( <$fast> ) if tcp_sport == 80; class ( <$med> ) if tcp_sport == 22 || tcp_dport == 22; class ( <$other> ) if 1; htb ( class ( rate 100Mbps, ceil 100Mbps ) { $local = class {sfq;} $fast = class ( rate 80kBps, ceil 80kBps ) {sfq;} $med = class ( rate 40kBps, ceil 40kBps ) {sfq;} $other = class ( rate 15kBps, ceil 20kBps ) {sfq;} } } }
Notes:
- Options like rate and ceil inherit by default (though in this case they are explicitly overridden in all but the $local class)
- Unit specification is not NOT exactly the same as in tc(verify)
- Case matters(verify) - those are megabits (b) per second and kilobytes (B) per second.
Since 'bit' also exists, you probably want to avoid abbreviations like '8Mb', since it is more confusable than the equivalent '8Mbit.'
See also [2]
Semi-sorted
MARK
You can use netfilter MARK values for shaping, but only indirectly, in that you can't use them in shaping rule tests.
Instead, you have to see what qdisc handle you something sent to, and use that number as the mark in your firewall. I am not yet sure how handle and flowid relate to this, and how to decide the mark on column'd handles.
Yes, this means that if you use this, you have to rewrite the mangle table every time you structurally change your traffic shaping setup.
Since tcng allows U32 type checking, you can usually avoid it. Cases in which you can't includes traffic that is NATted by the same machine, since it arrives at shaping after it has been translated.
To check whether the mangle rule seems to be working, or at least getting data, try:
watch -n 1 -d iptables -t mangle -vL