Rate limiting, traffic shaping, and such

Rate limiting

Backpressure

On analogies

Backpressure routing

Linux traffic shaping notes

Introduction

Linux traffic shaping = throttling network data, either by queueing, or dropping the occasional packet (the latter makes sense to make TCP back off, actually a decent way to avoid it affecting latency).

The goal is often to get the highest throughput with the lowest latency. (Standard caveat: This only makes sense if you are controlling something that is actually bottlenecked, because if not you either don't affect the actual bottleneck, or can only do so via overkill (and hoping you were the most important influence).

You configure a tree (per interface) of behaviours.

Traffic enters this tree at the root, and is routed according to filtering until it ends up at a leaf of the tree.

a filter

must have a classifier, which is used to select packets (and typically route it to a specific qdisc).

may have a policer, which can do things like 'if exceeding this rate, drop a packet'. (Seems rarely used, presumably too heavy-handed?)

There are a few classifier types:

u32 is the most common (can inspect all of a packet)

fw is useful when you use iptables to mark the packet's metadata

tcindex is useful only for DSMARK

others are route, rsvp, rsvp6

Tree elements are often configured with things like:

a capacity, which acts as a hard limit for the element
a user-set bandwidth that this element should target,
a particular resolution of control - long-term allows bursting but will make longer transfers oscillate between fast and slow, while short-term is a harder, more immediate limit
options that control e.g. whether elements may borrow any unused leftovers from sibling nodes

You could, for example, think up the following:

                   capacity    target speed
+ root             
|--+ local:         100Mbit    no limiting
|  `--- nfs          80Mbit    no bursting  (NFS is known for congesting networks, so avoid that)
`--+ inet :           1Mbit    
   |--- voip:         1Mbit    no limiting
   |--- http:       320Kbit    30KByte/s 
   |--- ftp:        320Kbit    20KByte/s
   `--- fileshare:  120Kbit    15KByte/s, no bursting

I intentionally mixed Kbit(/s) and KByte(/s). I'm pointing out you should be aware of this to avoid making mistakes :)

iptables and tc

Avoid l7 rules as sparingly as possible. They are complex and slow when compared to the rest of the network stack, so will actually increase latency a little. Still, they can help e.g. guarantee various VOIP always gets preference, something that is hard to do without.

Some terms

egress means outgoing data

ingress means incoming data

traffic shaping tries to squeeze the bandwith

traffic policing checks whether bandwidth use is in accordance with a traffic contract

usually that of a single link, because those are the only ones you can negotiate a contract

qdisc - queueing discipline, an "insert specific scheduler algorithm here" thing

The choice of algorithm is important if you want some specific behaviour

classful/classless is about configurability:

classless qdiscs - those whose behaviour is hardcoded and not configurable beyond basic rate figures. (They may well have multiple flows and be somewhat smart, but since you cannot add filters, classes, or attach child qdiscs, they are leaves in the tree and tend to serve a fairly simple final purpose

classful qdiscs are used to create the major logic in the tree. Filters appear only inside classful qdiscs and handle the classification of packets into one of the qdisc's classes.

Classes are the configurable inner workings that handle different flows of data. Each such class may have some properties according to the qdisc it is in (e.g., individually throttlable flows), and also contains a child qdisc.

Each class may be

isolated, meaning it will not (try to borrow or) give away bandwidth to sibling classes

bounded, meaning it will not borrow bandwidth from sibling classes

bursting means you may exceed the limit on the short term. Usually indicates that a qdisc observes the target rate in the long rather than the short term.

Common parameters:(verify)

rate roughly means the targeted rate. Under differently taxed conditions and depending on your design, this can act as a cap and a rough guarantee.

ceil caps the rate under borrowing conditions (value seems to default to parent ceil)
burst caps the amount of bytes bursted at 'ceil' speed (per jiffy - note the tbf-like limit this imposes; you may have to raise this under some conditions)
cburst caps the amount of bytes bursted as fast as the interface can handle it. Setting this high lowers equalizing properties.

Qdisc types

Qdiscs serve varying purposes. They have man pages, see apropos tc- for a list

Classless qdiscs

pfifo_fast

Used for ToS purposes: the (sixteen) ToS values are mapped to three bands (where a band is basically a FIFOs), namely interactive (0), inbetween (1), and bulk (2).

When there are packets waiting on more than one band, packets from lower-numbered bands are always emitted first.

if ToS fields are not used, this acts entirely like pfifo

so you may want to firewall rules to mark stuff

bfifo - Byte-based FIFO
pfifo - packet-based FIFO

Usually the default qdiscs in classful qdiscs(verify)

Mentioning 'fifo' in a configuration seems to mean bfifo(verify)

sfq

'Stochastic Fair Queuing' equalizes (does not limit) individual flows (mostly TCP connections and UDP streams), though cheap-and-decent round-robining, meaning fairness when the qdisc is saturated.

The 'perturb' option lets you specify how often it should change the details of the hash, which makes it more likely that distribution works well in the long run.

can be useful as a leaf for paths where you anticipate many concurrent connections

esfq

a variation of sfq that allows more control over the band decision process (and over some other details)

tbf

Token Bucket Filter is useful when you need a simple hard bandwidth cutoff, and allow no real bursting either.

The mentioned tokens are virtual 'you may send out a packet now' tokens that are replenished at a configured rate (and with a maximum amount).

Apparently uses one token (i.e. sends one packet) per jiffy (where that jiffy depends on the kernel configuration, often 1/100, 1/1000 or 1/125 of a second)(verify)

Note that that kernel rate both controls the actual meaning of the outgoing rate, and implies a maximum speed (sometimes rather lower than the speed an interface is capable of)

not afraid to drop packets (depending on the limit parameter)

gred, "Generic Random Early Drop

Apparently more for backbones, and not important to home networks

Classful

prio

Same idea as pfifo_fast, but you can configure the bands and TOS mapping yourself.

cbq

'Class Based Queuing' is a featured averaging shaper that keeps to the limit set over a longer period of time.

htb

Hierarchical Token Bucket is one of the more controllable methods

resembles tbf but in a simple hierarchy in which the parent can give tokens to the children.

(seems to queue a bit before it drops packets(verify) - more than than tbf, hfsc anyway)

hfsc

Hierarchical Fair Service Curve can be seen as a fancier HTB

It classifies, has limits (and will drop to enforce them), has more control over sharing, and also allowing "this has priority over everything else" rules (realtime).

Well-tuned, this lets you kep interactive latencies while using most of your bandwidth

dsmark

refers to Differentiated Services[1], a field in the IP header

Notes on hsfc

Related tools

monitoring

bmon is aware of qdiscs and can help visualize which ones are routing your traffic.

A quick and dirty solution:

#to watch the status of the classes:
watch -n 1 -d tc -s class show dev eth0

#and to see the qdisc structure:
watch -n 1 -d tc -s qdisc show dev eth0

Notes:

the dev eth0 shows one device instead of all.
ls is equivalent to and less typing than show
a tc -s is also a way to see if you've got tc support. You'll likely see a pfifo_fast (per device) when you have it.

tc

You can write rules for the tc command, which comes from iproute2 and is little more than a hook to kernel calls.

units

Example

Leaves the local net unthrottled, limits HTTP only somewhat, limits SSH more (because of SCP and SFTP), and throttles everything else down pretty hard.

This assumes your upstream is ~100KByte/s. We intentionally stay a little under that.

This setup will affect every service you don't list here, so in practice you may want to write it the other way around: specifically slow down anything you don't want to be disruptive, and having the catch-all rule be barely throttled or unthrottled.

dev "eth0" {
    egress {
        class ( <$local>  )  if ip_dst:16 == 192.168.0.0;
        class ( <$fast>   )  if tcp_sport == 80;
        class ( <$med>    )  if tcp_sport == 22 || tcp_dport == 22;
        class ( <$other>  )  if 1;
        htb ( class ( rate 100Mbps, ceil 100Mbps ) {
            $local  = class                              {sfq;}
            $fast   = class ( rate 80kBps, ceil 80kBps ) {sfq;}
            $med    = class ( rate 40kBps, ceil 40kBps ) {sfq;}
            $other  = class ( rate 15kBps, ceil 20kBps ) {sfq;}
        }
    }
}

Notes:

Options like rate and ceil inherit by default (though in this case they are explicitly overridden in all but the $local class)

Unit specification is not NOT exactly the same as in tc(verify)

Case matters(verify) - those are megabits (b) per second and kilobytes (B) per second.

Since 'bit' also exists, you probably want to avoid abbreviations like '8Mb', since it is more confusable than the equivalent '8Mbit.'

Semi-sorted

MARK

You can use netfilter MARK values for shaping, but only indirectly, in that you can't use them in shaping rule tests.

Instead, you have to see what qdisc handle you something sent to, and use that number as the mark in your firewall. I am not yet sure how handle and flowid relate to this, and how to decide the mark on column'd handles.

Yes, this means that if you use this, you have to rewrite the mangle table every time you structurally change your traffic shaping setup.

Since tcng allows U32 type checking, you can usually avoid it. Cases in which you can't includes traffic that is NATted by the same machine, since it arrives at shaping after it has been translated.

To check whether the mangle rule seems to be working, or at least getting data, try:

watch -n 1 -d iptables -t mangle -vL