Linux Advanced Routing & Traffic Control HOWTO: Queueing Disciplines for Bandwidth Management

Whole document tree

www.fifi.org
    Documentation
        Manpages
        GNU Info
        Debian document tree
        Whole document tree
    Trigance web page
    Public services
    User info
    Mailing lists
    Secure server
    Multilingual usage

Validate HTML
Validate CSS

Whole document tree

Linux Advanced Routing & Traffic Control HOWTO: Queueing Disciplines for Bandwidth Management Next Previous Contents

9. Queueing Disciplines for Bandwidth Management

Now, when I discovered this, it really blew me away. Linux 2.2/2.4 comes with everything to manage bandwidth in ways comparable to high-end dedicated bandwidth management systems.

Linux even goes far beyond what Frame and ATM provide.

9.1 Queues and Queueing Disciplines explained

With queueing we determine the way in which data is sent. It is important to realise that we can only shape data that we transmit.

With the way the internet works, we have no direct control of what people send us. It's a bit like your (physical!) mailbox at home. There is no way you can influence the world to modify the amount of mail they send you, short of contacting everybody.

However, the internet is mostly based on TCP/IP which has a few features that help us. TCP/IP has no way of knowing the capacity of the network between two hosts, so it just starts sending data faster and faster ('slow start') and when packets start getting lost, because there is no room to send them, it will slow down. In fact it is a bit smarter than this, but more about that later.

This is the equivalent of not reading half of your mail, and hoping that people will stop sending it to you. With the difference that it works for the Internet :-)

If you have a router and wish to prevent certain hosts of networks from downloading too fast, you need to do your shaping on the *inner* interface of your router, the one that sends data to your own computers.

9.2 Simple, classless Queueing Disciplines

As said, with queueing disciplines, we change the way data is sent. Classless queueing disciplines are those that, by and large accept data and only reorder, delay or drop it.

These can be used to shape traffic for an entire interface, without any subdivisions. It is vital that you understand this part of queueing before we go on the the classful qdisc-containing-qdiscs!

By far the most widely used discipline is the pfifo_fast queue - this is the default. This also explains why these advanced features are so robust. They are nothing more than 'just another queue'.

Each of these queues has specific strengths and weaknesses. Not all of them may be as well tested.

pfifo_fast

This queue is, as the name says, First In, First Out, which means that no packet receives special treatment. At least, not quite. This queue has 3 so called 'bands'. Within each band, FIFO rules apply. However, as long as there are packets waiting in band 0, band 1 won't be processed. Same goes for band 1 and band 2.

The kernel honors the so called Type of Service flag of packets, and takes care to insert 'minimum delay' packets in band 0.

Do not confuse this classless simple qdisc with the classful PRIO one!

Parameters & usage

bands

Number of bands. Defaults to three. If you change this, also change:

priomap

Determines how packet priorities, as assigned by the kernel, map to bands. Mapping occurs according to the following table, which identical to the one used by the PRIO qdisc:

TC_PRIO..          Num  TOS                        Band
-------------------------------------------------------
BESTEFFORT         0    Maximize Reliablity        1
FILLER             1    Minimize Cost              2
BULK               2    Maximize Throughput (0x8)  2
INTERACTIVE_BULK   4                               2
INTERACTIVE        6    Minimize Delay (0x10)      1
CONTROL            7                               2
                   8                               0
                   9                               0
                  10                               1
                  11                               1 
                  12                               1
                  13                               1
                  14                               1
                  15                               1

FIXME: It is not known what the higher priorities correspond to.

SSH sets TOS to 'Minimize Delay', unless it is doing scp, in which case it sets 'Maximize Throughput'. The numbers in parentheses denote the TOS value as reported by tcpdump and the kernel. If you divide this by two, you get the values mentioned by RFC1349.

The default priomap is reasonable, you probably do not need to change it.

txqueuelen

The length of this queue is gleaned from the interface configuration, which you can see and set with ifconfig and ip. To set the queue length to 10, execute: ifconfig eth0 txqueuelen 10

You can't set this parameter with tc!

Token Bucket Filter

The Token Bucket Filter (TBF) is a simple queue that only passes packets arriving at a rate which is not exceeding some administratively set rate, with the possibility to allow short bursts in excess of this rate.

TBF is very precise, network- and processor friendly. It should be your first choice if you simple want to slow an interface down!

The TBF implementation consists of a buffer (bucket), constantly filled by some virtual pieces of information called tokens, at a specific rate (token rate). The most important parameter of the bucket is its size, that is the number of tokens it can store.

Each arriving token collects one incoming data packet from the data queue and is then deleted from the bucket. Associating this algorithm with the two flows -- token and data, gives us three possible scenarios:

The data arrives in TBF at a rate that's equal to the rate of incoming tokens. In this case each incoming packet has its matching token and passes the queue without delay.
The data arrives in TBF at a rate that's smaller than the token rate. Only a part of the tokens are deleted at output of each data packet that's sent out the queue, so the tokens accumulate, up to the bucket size. The unused tokens can then be used to send data a a speed that's exceeding the standard token rate, in case short data bursts occur.
The data arrives in TBF at a rate bigger than the token rate. This means that the bucket will soon be devoid of tokens, which causes the TBF to throttle itself for a while. This is called an 'overlimit situation'. If packets keep coming in, packets will start to get dropped.

The last scenario is very important, because it allows to administratively shape the bandwidth available to data that's passing the filter.

The accumulation of tokens allows a short burst of overlimit data to be still passed without loss, but any lasting overload will cause packets to be constantly delayed, and then dropped.

Please note that in the actual implementation, tokens correspond to bytes, not packets.

Parameters & usage

Even though you will probably not need to change them, tbf has some knobs available. First the parameters that are always available:

limit or latency

Limit is the number of bytes that can be queued waiting for tokens to become available. You can also specify this the other way around by setting the latency parameter, which specifies the maximum amount of time a packet can sit in the TBF. The latter calculation takes into account the size of the bucket, the rate and possibly the peakrate (if set).

burst/buffer/maxburst

Size of the bucket, in bytes. This is the maximum amount of bytes that tokens can be available for instantaneously. In general, larger shaping rates require a larger buffer. For 10mbit/s on Intel, you need at least 10kbyte buffer if you want to reach your configured rate!

If your buffer is too small, packets may be dropped because more tokens arrive per timer tick than fit in your bucket.

mpu

A zero-sized packet does not use zero bandwidth. For ethernet, no packet uses less than 64 bytes. The Minimum Packet Unit determines the minimal token usage for a packet.

rate

The speedknob. See remarks above about limits!

If the bucket contains tokens and is allowed to empty, by default it does so at infinite speed. If this is unacceptable, use the following parameters:

peakrate

If tokens are available, and packets arrive, they are sent out immediately by default, at 'lightspeed' so to speak. That may not be what you want, especially if you have a large bucket.

The peakrate can be used to specify how quickly the bucket is allowed to be depleted. If doing everything by the book, this is achieved by releasing a packet, and then wait just long enough, and release the next. We calculated our waits so we send just at peakrate.

However, due to de default 10ms timer resolution of Unix, with 10.000 bits average packets, we are limited to 1mbit/s of peakrate!

mtu/minburst

The 1mbit/s peakrate is not very useful if your regular rate is more than that. A higher peakrate is possible by sending out more packets per timertick, which effectively means that we create a second bucket!

This second bucket defaults to a single packet, which is not a bucket at all.

To calculate the maximum possible peakrate, multiply the configured mtu by 100 (or more correctly, HZ, which is 100 on intel, 1024 on Alpha).

Sample configuration

A simple but *very* useful configuration is this:

# tc qdisc add dev ppp0 root tbf rate 220kbit latency 50ms burst 1500

Ok, why is this useful? If you have a networking device with a large queue, like a DSL modem or a cablemodem, and you talk to it over a fast device, like over an ethernet interface, you will find that uploading absolutely destroys interactivity.

This is because uploading will fill the queue in the modem, which is probably *huge* because this helps actually achieving good data throughput uploading. But this is not what you want, you want to have the queue not too big so interactivity remains and you can stil do other stuff while sending data.

The line above slows down sending to a rate that does not lead to a queue in the modem - the queue will be in Linux, where we can control it to a limited size.

Change 220kbit to your uplinks *actual* speed, minus a few percent. If you have a really fast modem, raise 'burst' a bit.

Stochastic Fairness Queueing

Stochastic Fairness Queueing (SFQ) is a simple implementation of the fair queueing algorithms family. It's less accurate than others, but it also requires less calculations while being almost perfectly fair.

The key word in SFQ is conversation (or flow), which mostly corresponds to a TCP session or a UDP stream. Traffic is divided into a pretty large number of FIFO queues, one for each conversation. Traffic is then sent in a round robin fashion, giving each session the chance to send data in turn.

This leads to very fair behaviour and disallows any single conversation from drowning out the rest. SFQ is called 'Stochastic' because it doesn't really allocate a queue for each session, it has an algorithm which divides traffic over a limited number of queues using a hashing algorithm.

Because of the hash, multiple sessions might end up in the same bucket, which would halve each session's chance of sending a packet, thus halving the effective speed available. To prevent this situation from becoming noticeable, SFQ changes its hashing algorithm quite often so that any two colliding sessions will only do so for a small number of seconds.

It is important to note that SFQ is only useful in case your actual outgoing interface is really full! If it isn't then there will be no queue on your linux machine and hence no effect. Later on we will describe how to combine SFQ with other qdiscs to get a best-of-both worlds situation.

Specifically, setting SFQ on the ethernet interface heading to your cablemodem or DSL router is pointless without further shaping!

Parameters & usage

The SFQ is pretty much selftuning:

perturb: Reconfigure hashing once this many seconds. If unset, hash will never be reconfigured. Not recommended. 10 seconds is probably a good value.
quantum: Amount of bytes a stream is allowed to dequeue before the next queue gets a turn. Defaults to 1 maximum sized packet (MTU-sized). Do not set below the MTU!

Sample configuration

If you have a device which has identical link speed as actual available rate, like a phone modem, this configuration will help promote fairness:

# tc qdisc add dev ppp0 root sfq perturb 10
# tc -s -d qdisc ls
qdisc sfq 800c: dev eth0 quantum 1514b limit 128p flows 128/1024 perturb 10sec 
 Sent 4812 bytes 62 pkts (dropped 0, overlimits 0)

The number 800c: is the automatically assigned handle number, limit means that 128 packets can wait in this queue. There are 1024 hashbuckets available for accounting, of which 128 can be active at a time (no more packets fit in the queue!) Once every 10 seconds, the hashes are reconfigured.

9.3 Advice for when to use which queue

Summarizing, these are the simple queues that actually manage traffic by reordering, slowing or dropping packets.

The following tips may help in chosing which queue to use. It mentions some qdiscs described in the 'Advanced & less common queueing disciplines'.

To purely slow down outgoing traffic, use the Token Bucket Filter. Works up to huge bandwidths, if you scale the bucket.
If your link is truly full and you want to make sure that no single session can dominate your outgoing bandwidth, use Stochastical Fairness Queueing.
If you have a big backbone and know what you are doing, consider Random Early Drop (see Advanced chapter).
To 'shape' incoming traffic which you are not forwarding, use the Ingress Policer.
If you *are* forwarding it, use a TBF on the interface you are forwarding the data to.
If you don't want to shape, but only want to see if your interface is so loaded that it has to queue, use the pfifo queue (not pfifo_fast). It lacks internal bands but does account the size of its backlog.

9.4 Classful Queueing Disciplines

Some queueing disciplines can contain other queueing disciplines, which are then suddenly called 'classes'. A class is nothing short of a qdisc, except that it lives within another qdisc. We use the terms 'inner qdisc' , 'sub-qdisc' and 'class' interchangeably.

Classful qdiscs are very useful if you have different kinds of traffic which should have differing treatment. One of the classful qdiscs is called 'CBQ' , 'Class Based Queueing' - it is so widely mentioned that people identify queueing with classes solely with CBQ, but this is not the case.

CBQ is merely the oldest kid on the block - yet it is by far the least useful qdisc and also the most complex one. I advise *against* using it. This may come as something of a shock to many who fell for the 'sendmail effect', which learns us that any complex technology which doesn't come with documentation must be the best available.

More about CBQ and it's alternatives shortly.

Flow within classful qdiscs & classes

When traffic enters a classful qdisc, it needs to be sent to any of the qdiscs within - the classes. To determine what to do with a packet, the so called 'filters' are consulted. It is important to know that the filters are called from within a qdisc, and not the other way around!

The filters attached to that qdisc then return with a decision, and the qdisc uses this to enqueue the packet into one of the classes. These classes don't know that they are part of an outer-qdisc, they act as they normally do: accepting packets on one end and outputting them again when asked.

Besides containing other qdiscs, most classful qdiscs also perform shaping. This is useful to perform both packet reordering (with SFQ, for example) and rate control. You need this in cases where you have a high speed interface (for example, ethernet) to a slower device (a cable modem).

If you were only to run SFQ, nothing would happen, as packets enter & leave your router without delay: the output interface is far faster than your actual link speed. There is no queue to process then.

The qdisc family: roots, handles, siblings and parents

Each interface has a 'root qdisc', by default the earlier mentioned classless pfifo_fast queueing discipline. Each qdisc can be assigned a handle, which can be used by later configuration statements to refer to that qdisc.

These handles consist of two parts, a major number and a minor number. It is habitual to name the root qdisc '1:', which is equal to '1:0'.

Because classful qdiscs can have a lot of children, you should give them their own namespace by giving them a separate major number.

How filters are used to classify traffic

Recapping, a typical hierarchy might look like this:

                    root 1:
                      |
                    _1:1_
                   /  |  \
                  /   |   \
                 /    |    \
               10:   11:   12:
              /   \       /   \
           10:1  10:2   12:1  12:2

But don't let this tree fool you! You should *not* imagine the kernel to be at the apex of the tree and the network below, that is just not the case. Packets get enqueued and dequeued at the root qdisc, which is the only thing the kernel talks to.

A packet might get enqueued in a chain like this:

1: -> 1:1 -> 12: -> 12:2

The packet now resides in a queue in qdisc 12:2. In this example, a filter was attached to each 'node' in the tree, each chosing a branch to take next. This can make sense. However, tnis is also possible:

1: -> 12:2

In this case, a filter attached to the root decided to send the packet directly to 12:2.

How packets are dequeued to the hardware

When the kernel decides that it needs to extract packets to send to the interface, the root qdisc 1: gets a dequeue request, which is passed to 1:1, which is in turn passed to 10:, 11: and 12:, which each query their siblings, and try to dequeue() from them. In this case, the kernel needs to walk the entire tree, because only 12:2 contains a packet.

In short, nested qdiscs ONLY talk to their parent qdiscs, never to an interface. Only the root qdisc gets dequeued by the kernel!

The upshot of this is that sub-qdiscs never get dequeued faster than their parents allow. And this is exactly what we want: this way we can have SFQ as an inner class, which doesn't do any shaping, only reordering, and have a shaping outer class, which does the shaping.

The PRIO qdisc

The PRIO qdisc doesn't actually shape, it only subdivides traffic based on how you configured your filters. You can consider the PRIO qdisc a kind of pfifo_fast on stereoids, whereby each band is a separate qdisc instead of a simple FIFO.

When a packet is enqueued to the PRIO qdisc, a sub-qdisc is chosen based on the filter commands you gave. By default, three pfifo sub-qdiscs are created. These sub-qdiscs are by default pure FIFO queues with no internal structure, but you can replace them by any qdisc you have available.

Whenever a packet needs to be dequeued, class :1 is tried first. Higher classes are only used of lower bands all did not give up a packet.

This queue is very useful in case you want to prioritize certain kinds of traffic without using TOS-flags but using all the power of the tc filters. Because it doesn't actually shape, the same warning as for SFQ holds: either use it only if your physical link is really full or wrap it inside a classful qdisc that does shape.

The last holds for almost all cablemodems and DSL devices.

PRIO parameters & usage

The following parameters are recognized by tc:

bands

Number of bands to create. Each band is in fact a class. If you change this number, you should probably also change the priomap.

priomap

If you do not provide tc filters to classify traffic, the PRIO qdisc looks at the TC_PRIO priority to decide how to enqueue traffic. The kernel assigns each packet a TC_PRIO priority, based on TOS flags or socket options passed by the application.

The TC_PRIO is decided based on the TOS, and mapped as follows:

TC_PRIO..          Num  TOS                        Band
-------------------------------------------------------
BESTEFFORT         0    Maximize Reliablity        1
FILLER             1    Minimize Cost              2
BULK               2    Maximize Throughput (0x8)  2
INTERACTIVE_BULK   4                               2
INTERACTIVE        6    Minimize Delay (0x10)      1
CONTROL            7                               2
                   8                               0
                   9                               0
                  10                               1
                  11                               1 
                  12                               1
                  13                               1
                  14                               1
                  15                               1

FIXME: It is not known what the higher priorities confirm to.

The default values are reasonable, you probably do not need to change them.

The bands are classes, and are called major:1 to major:3 by default, so if your PRIO qdisc is called 12:, tc filter traffic to 12:1 to grant it more priority.

Reiterating, band 0 goes to minor number 1! Band 1 to minor number 2, etc.

Sample configuration

We will create this tree:

     root 1: prio
       /   |   \
     1:1  1:2  1:3
      |    |    |
     10:  20:  30:
     sfq  tbf  sfq
band  0    1    2

Bulk traffic will go to 30:, interactive traffic to 20: or 10:.

Commandlines:

# tc qdisc add dev eth0 root handle 1: prio 
## This *instantly* creates 1:1, 1:2, 1:3
  
# tc qdisc add dev eth0 parent 1:1 handle 10: sfq
# tc qdisc add dev eth0 parent 1:2 handle 20: tbf rate 20kbit buffer 1600 limit 3000
# tc qdisc add dev eth0 parent 1:3 handle 30: sfq

Now lets's see what we created:

# tc -s qdisc ls dev eth0 
qdisc sfq 30: quantum 1514b 
 Sent 0 bytes 0 pkts (dropped 0, overlimits 0) 

 qdisc tbf 20: rate 20Kbit burst 1599b lat 667.6ms 
 Sent 0 bytes 0 pkts (dropped 0, overlimits 0) 

 qdisc sfq 10: quantum 1514b 
 Sent 132 bytes 2 pkts (dropped 0, overlimits 0) 

 qdisc prio 1: bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
 Sent 174 bytes 3 pkts (dropped 0, overlimits 0)

As you can see, band 0 has already had some traffic, and one packet was sent while running this command!

We now do some bulk data transfer with a tool that properly sets TOS flags, and take another look:

# scp tc ahu@10.0.0.11:./
ahu@10.0.0.11's password: 
tc                   100% |*****************************|   353 KB    00:00    
# tc -s qdisc ls dev eth0
qdisc sfq 30: quantum 1514b 
 Sent 384228 bytes 274 pkts (dropped 0, overlimits 0) 

 qdisc tbf 20: rate 20Kbit burst 1599b lat 667.6ms 
 Sent 2640 bytes 20 pkts (dropped 0, overlimits 0) 

 qdisc sfq 10: quantum 1514b 
 Sent 2230 bytes 31 pkts (dropped 0, overlimits 0) 

 qdisc prio 1: bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
 Sent 389140 bytes 326 pkts (dropped 0, overlimits 0)

As you can see, all traffic went to handle 30:, which is the lowest priority band, just as intended. Now to verify that interactive traffic goes to higher bands, we create some interactive traffic:

# tc -s qdisc ls dev eth0
qdisc sfq 30: quantum 1514b 
 Sent 384228 bytes 274 pkts (dropped 0, overlimits 0) 

 qdisc tbf 20: rate 20Kbit burst 1599b lat 667.6ms 
 Sent 2640 bytes 20 pkts (dropped 0, overlimits 0) 

 qdisc sfq 10: quantum 1514b 
 Sent 14926 bytes 193 pkts (dropped 0, overlimits 0) 

 qdisc prio 1: bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
 Sent 401836 bytes 488 pkts (dropped 0, overlimits 0)

It worked - all additional traffic has gone to 10:, which is our highest priority qdisc. No traffic was sent to the lowest priority, which previously received our entire scp.

The famous CBQ qdisc

As said before, CBQ is the most complex qdisc available, the most hyped, the least understood and possibly the worst queueing discipline for you in the entire Linux kernel. This is not because the authors are evil or incompetent, far from it, it's just that the CBQ algorithm isn't all that precise and doesn't really match the way Linux works.

Besides being classful, CBQ is also a shaper and it is in that aspect that it really doesn't work very well. It should work like this. If you try to shape a 10mbit/s connection to 1mbit/s, the link should be idle 90% of the time. If it isn't, we need to throttle so that it IS idle 90% of the time.

This is pretty hard to measure, so CBQ also needs to know how big an average packet is going to be, and instead derives the idle time from the number of microseconds that elapse between requests from the hardware layer for more data. Combined, this can be used to approximate how full or empty the link is.

This is rather circumspect and doesn't always arrive at proper results. For example, what is the actual link speed of an interface that is not really able to transmit the full 100mbit/s of data, perhaps because of a badly implemented driver? A PCMCIA network card will also never achieve 100mbit/s because of the way the bus is designed - again, how do we calculate the idle time?

It gets even worse if we consider not-quite-real network devices like PPP over Ethernet or PPTP over TCP/IP. The effective bandwidth in that case is probably determined by the efficiency of pipes to userspace - which is huge.

People who have done measurements discover that CBQ is not very accurate and sometimes completely misses the mark.

Besides not being all that good, it also comes with little or no documentation AND with about 20 knobs to tune. In fact, CBQ is so hard to configure that people use scripts to generate the needed commands.

In short - I would advise AGAINST using CBQ if you want to have accurate results and want to understand what you are doing.

CBQ shaping in detail

Because lots of people are using CBQ anyhow, possibly because they don't have anything else available, we will describe it here.

As said before, CBQ works by making sure that the link is idle just long enough to bring down the real bandwidth to the configured rate. To do so, it calculates the time that should pass between average packets.

During operations, the effective idletime is measured using an exponential weighted moving average (EWMA), which considers recent packets to be exponentially more important than past ones. The unix loadaverage is calculated in the same way.

The calculated idle time is substracted from the EWMA measured one, the resulting number is called 'avgidle'. A perfectly loaded link has an avgidle of zero: packets arrive exactly once every calculated interval.

An overloaded link has a negative avgidle and if it gets too negative, CBQ shuts down for a while and is then 'overlimit'.

Conversely, an idle link might amass a huge avgidle, which would then allow infinite bandwidths after a few hours of silence. To prevent this, avgidle is capped at maxidle.

If overlimit, in theory, the CBQ could throttle itself for exactly the amount of time that was calculated to pass between packets, and then pass one packet, and throttle again. But see the 'minburst' parameter below.

These are parameters you can specify in order to configure shaping:

avpkt

Average size of a packet, measured in bytes. Needed for idle time approximation.

bandwidth

The physical bandwidth of your device, also needed for idle time calculations.

cell

The time a packet takes to be transmitted over an ethernet device grows in steps, based on the packet size. An 800 and a 806 size packet may take just as long to send, for example - this sets the granularity. Most often set to '8'. Must be an integral power of two.

maxburst

This number of packets is used to calculate maxidle so that when avgidle is at maxidle, this number of average packets can be burst before avgidle drops to 0. Set it higher to be more tolerant of bursts. You can't set maxidle directly, only via this parameter.

minburst

As mentioned before, CBQ needs to throttle in case of overlimit. The ideal solution is to do so for exactly the calculated idle time, and pass 1 packet. However, Unix kernels generally have a hard time scheduling events shorter than 10ms, so it is better to throttle for a longer period, and then pass minburst packets in one go, and then sleep minburst times longer.

The time to wait is called the offtime. Higher values of minburst lead to more accurate shaping in the long term, but to bigger bursts at millisecond timescales.

minidle

If avgidle is below 0, we are overlimits and need to wait until avgidle will be big enough to send one packet. To prevent a sudden burst from shutting down the link for a prolonged period of time, avgidle is reset to minidle if it gets too low.

Minidle is specified in negative microseconds, so 10 means that avgidle is capped at -10us.

mpu

Mininum packet size - needed because even a zero size packet is padded to 64 bytes on ethernet, and so takes a certain time to transmit. CBQ needs to know this to accurately measure the idle time.

rate

Desired rate of traffic leaving this qdisc - this is the 'speed knob'!

Internally, CBQ has a lot of finetuning. For example, classes which are known not to have data enqueued to them aren't queried. Overlimit classes are penalized by lowering their effective priority. All very smart & complicated.

CBQ classful behaviour

Besides shaping, using the aforementioned idletime approximations, CBQ also acts like the PRIO queue in the sense that classes can have differing priorities and that lower priority numbers will be polled before the higher priority ones.

Each time a packet is requested by the hardware layer to be sent out to the network, a weighted round robin process starts, beginning with the lower priority classes.

These are then grouped and queried if they have data available. If so, it is returned. After a class has been allowed to dequeue a number of bytes, the next class within that priority is tried.

The following parameters control the WRR process:

allot

When the outer cbq is asked for a packet to send out on the interface, it will try all inner qdiscs (classes) in turn, in order of the 'priority' parameter. Each time a class gets its turn, it can only send out a limited amount of data. 'Allot' is the base unit of this amount. See the 'weight' parameter for more information.

prio

The CBQ can also act like the PRIO device. Inner classes with lower priority are tried first and as long as they have traffic, other classes are not polled for traffic.

weight

Weight helps in the Weighted Round Robin process. Each class gets a chance to send in turn. If you have classes with significantly more bandwidth than other classes, it makes sense to allow them to send more data in one round than the others.

A CBQ adds up all weights within a class, and normalizes them, so you can use arbitrary numbers: only the ratios are important. People have been using 'rate/10' as a rule of thumb and it appears to work well. The renormalized weight is multiplied by the 'allot' parameter to determine how much data can be sent in one round.

CBQ parameters that determine link sharing & borrowing

Besides purely limiting certain kinds of traffic, it is also possible to specify which classes can borrow capacity from other classes or, conversely, lend out bandwidth.

Isolated/sharing

A class that is configured with 'isolated' will not lend out bandwidth to sibling classes. Use this if you have competing or mutually-unfriendly agencies on your link who do want to give eachother freebies.

The control program tc also knows about 'sharing', which is the reverse of 'isolated'.

bounded/borrow

A class can also be 'bounded', which means that it will not try to borrow bandwidth from sibling classes. tc also knows about 'borrow', which is the reverse of 'bounded'.

A typical situation might be where you have two agencies on your link which are both 'isolated' and 'bounded', which means that they are really limited to their assigened rate, and also won't allow each other to borrow.

Within such an agency class, there might be other classes which are allowed to swap bandwidth.

Sample configuration

This configuration limits webserver traffic to 5mbit and smtp traffic to 3 mbit, and limits the sum to 5mbit:

# tc qdisc add dev eth0 root handle 1:0 cbq bandwidth 100Mbit         \ 
  avpkt 1000 cell 8
# tc class add dev eth0 parent 1:0 classid 1:1 cbq bandwidth 100Mbit  \
  rate 5Mbit weight 0.5Mbit prio 8 allot 1514 cell 8 maxburst 20      \
  avpkt 1000

This part installs the root and the customary 1:0 sub-root. While it is possible to remove the first line, and attach the second line directly to the root, there are some subtleties involved which are avoided if adding this extra layer.

As said before, CBQ requires a *lot* of knobs. All parameters are explained above, however. The corresponding HTB configuration is lots simpler.

# tc class add dev eth0 parent 1:1 classid 10: cbq bandwidth 100Mbit  \
  rate 5Mbit weight 0.5Mbit prio 5 allot 1514 cell 8 maxburst 20      \
  avpkt 1000 bounded                                                    
# tc class add dev eth0 parent 1:1 classid 20: cbq bandwidth 100Mbit  \
  rate 3Mbit weight 0.3Mbit prio 5 allot 1514 cell 8 maxburst 20      \
  avpkt 1000 bounded

These are our two classes. Note how we scale the weight with the configured rate. Also note that both classes are bounded and won't therefore try to borrow traffic.

# tc qdisc add dev eth0 parent 10: tbf rate 5Mbit buffer 10Kb/8 limit \
  15Kb mtu 1500
# tc qdisc add dev eth0 parent 20: tbf rate 3Mbit buffer 10Kb/8 limit \
  15Kb mtu 1500

Here we install token bucket filters in the two configured subclasses. The /8 corresponds to the cell size we mentioned earlier for CBQ. We create a bucket of 10kbytes of tokens, a maximum 'pre-bucket' backlog of 15kbyte.

# tc filter add dev eth0 parent 1:0 protocol ip prio 1 u32 match ip \
  sport 80 0xffff flowid 10:0
# tc filter add dev eth0 parent 1:0 protocol ip prio 1 u32 match ip \
  sport 25 0xffff flowid 20:0

These commands, attached directly to the root, send traffic to the right qdiscs.

Note that we use 'tc class add' to CREATE classes within a qdisc, but that we use 'tc qdisc add' to actually configure these classes.

You may wonder what happens to traffic that is not classified by any of the two rules. It appears that in this case, data will then be processed within 1:0, and be unlimited. This can be configured in a variety of ways, which I do not understand. Use HTB :-)

If smtp+web together try to exceed the set limit of 5mbit/s, bandwidth will be divided according to the weight parameter, giving 5/8 of traffic to the webserver and 3/8 to the mailserver.

Other CBQ parameters: split & defmap

As said before, a classful qdisc needs to call filters to determine which class a packet will be enqueued to.

Besides calling the filter, CBQ offers other options, defmap & split. This is pretty complicated to understand, and it is not vital. But as this is the only known place where defmap & split are properly explained, I'm doing my best.

As you will often want to filter on the Type of Service field only, a special syntax is provided. Whenever the CBQ needs to figure out where a packet needs to be enqueued, it checks if this node is a 'split node'. If so, one of the sub-qdiscs has indicated that it wishes to receive all packets with a certain configured priority, as might be derived from the TOS field, or socket options set by applications.

The packets' priority bits are or-ed with the defmap field to see if a match exists. In other words, this is a short-hand way of creating a very fast filter, which only matches certain priorities. A defmap of ff (hex) will match everything, a map of 0 nothing. A sample configuration may help make things clearer:

# tc qdisc add dev eth1 root handle 1: cbq bandwidth 10Mbit allot 1514 \
  cell 8 avpkt 1000 mpu 64
 
# tc class add dev eth1 parent 1:0 classid 1:1 cbq bandwidth 10Mbit    \
  rate 10Mbit allot 1514 cell 8 weight 1Mbit prio 8 maxburst 20        \
  avpkt 1000

Standard CBQ preamble. I never get used to the sheer amount of numbers required!

Defmap refers to TC_PRIO bits, which are defined as follows:

TC_PRIO..          Num  Corresponds to TOS
-------------------------------------------------
BESTEFFORT         0    Maximuze Reliablity        
FILLER             1    Minimize Cost              
BULK               2    Maximize Throughput (0x8)  
INTERACTIVE_BULK   4                               
INTERACTIVE        6    Minimize Delay (0x10)      
CONTROL            7

This corresponds to bits, counted from the right. Now the interactive, and the bulk classes:

# tc class add dev eth1 parent 1:1 classid 1:2 cbq bandwidth 10Mbit     \
  rate 1Mbit allot 1514 cell 8 weight 100Kbit prio 3 maxburst 20        \
  avpkt 1000 split 1:0 defmap c0

# tc class add dev eth1 parent 1:1 classid 1:3 cbq bandwidth 10Mbit     \
  rate 8Mbit allot 1514 cell 8 weight 800Kbit prio 7 maxburst 20        \
  avpkt 1000 split 1:0 defmap 3f

The 'split qdisc' is 1:0, which is where the choice will be made. C0 is binary for 11000000, 3F for 00111111, so these two together will match everything. The first class matches bits 7 & 6, and thus corresponds to 'interactive' and 'control' traffic. The second class matches the rest.

Node 1:0 now has a table like this:

priority        send to
0               1:3
1               1:3
2               1:3
3               1:3
4               1:3
5               1:3
6               1:2
7               1:2

For additional fun, you can also pass a 'change mask', which indicates exactly which priorities you wish to change. You only need to use this if you are running 'tc class change'. For example, to add best effort traffic to 1:2, we could run this:

# tc class change dev eth1 classid 1:2 cbq defmap 01/01

The priority map over at 1:0 now looks like this:

priority        send to
0               1:2
1               1:3
2               1:3
3               1:3
4               1:3
5               1:3
6               1:2
7               1:2

FIXME: did not test this, only looked at the source.

Hierarchical Token Bucket

Martin Devera (<devik>) rightly realised that CBQ is complex and does not seem optimized for many typical situations. His Hierarchial approach is well suited for setups where you have a fixed amount of bandwidth which you want to divide for different purposes, giving each purpose a guaranteed bandwidth, with the possibility of specifying how much bandwidth can be borrowed.

HTB works just like CBQ but does not resort to idle time calculations to shape. Instead, it is a classful Token Bucket Filter - hence the name. It has only a few parameters, which are well documented on his site.

As your HTB configuration gets more complex, your configuration scales well. With CBQ it is already complex even in simple cases! HTB is not yet a part of the standard kernel, but it should soon be!

If you are in a position to patch your kernel, by all means use HTB instead of CBQ.

Sample configuration

Functionally almost identical to the CBQ sample configuration above:


# tc qdisc add dev eth0 root handle 1: htb default 30

# tc class add dev eth0 parent 1: classid 1:1 htb rate 5mbit burst 15k

# tc class add dev eth0 parent 1:1 classid 1:10 htb rate 5mbit burst 15k
# tc class add dev eth0 parent 1:1 classid 1:20 htb rate 3mbit ceil 5mbit burst 15k
# tc class add dev eth0 parent 1:1 classid 1:30 htb rate 1kbit ceil 5mbit burst 15k

The author then recommends SFQ for beneath these classes:


# tc qdisc add dev eth0 parent 1:10 handle 10: sfq perturb 10
# tc qdisc add dev eth0 parent 1:20 handle 20: sfq perturb 10
# tc qdisc add dev eth0 parent 1:30 handle 30: sfq perturb 10

Add the filters which direct traffic to the right classes:


# U32="tc filter add dev eth0 protocol ip parent 1:0 prio 1 u32"
# $U32 match ip dport 80 0xffff flowid 1:10
# $U32 match ip sport 25 0xffff flowid 1:20

And that's it - no unsightly unexplained numbers, no undocumented parameters.

HTB certainly looks wonderful - if 10: and 20: both have their guaranteed bandwidth, and more is left to divide, they borrow in a 5:3 ratio, just as you would expect.

Unclassified traffic gets routed to 30:, which has little bandwidth of its own but can borrow everything that is left over. Because we chose SFQ internally, we get fairness thrown in for free!

9.5 Classifying packets with filters

To determine which class shall process a packet, the so-called 'classifier chain' is called each time a choice needs to be made. This chain consists of all filters attached to the classful qdisc that needs to decide.

To reiterate the tree, which is not a tree:

                    root 1:
                      |
                    _1:1_
                   /  |  \
                  /   |   \
                 /    |    \
               10:   11:   12:
              /   \       /   \
           10:1  10:2   12:1  12:2

When enqueueing a packet, at each branch the filter chain is consulted for a relevant instruction. A typical setup might be to have a filter in 1:1 that directs a packet to 12: and a filter on 12: that sends the packet to 12:2.

You might also attach this latter rule to 1:1, but you can make efficiency gains by having more specific tests lower in the chain.

You can't filter a packet 'upwards', by the way. Also, with HTB, you should attach all filters to the root!

And again - packets are only enqueued downwards! When they are dequeued, they go up again, where the interface lives. They do NOT fall off the end of the tree to the network adaptor!

Some simple filtering examples

As explained in the Classifier chapter, you can match on literally anything, using a very complicated syntax. To start, we will show how to do the obvious things, which luckily are quite easy.

Let's say we have a PRIO qdisc called '10:' which contains three classes, and we want to assign all traffic from and to port 22 to the highest priority band, the filters would be:


# tc filter add dev eth0 protocol ip parent 10: prio 1 u32 match \ 
  ip dport 22 0xffff flowid 10:1
# tc filter add dev eth0 protocol ip parent 10: prio 1 u32 match \
  ip sport 80 0xffff flowid 10:1
# tc filter add dev eth0 protocol ip parent 10: prio 2 flowid 10:2

What does this say? It says: attach to eth0, node 10: a priority 1 u32 filter that matches on IP destination port 22 *exactly* and send it to band 10:1. And it then repeats the same for source port 80. The last command says that anything unmatched so far should go to band 10:2, the next-highest priority.

You need to add 'eth0', or whatever your interface is called, because each interface has a unique namespace of handles.

To select on an IP address, use this:


# tc filter add dev eth0 parent 10:0 protocol ip prio 1 u32 \ 
  match ip dst 4.3.2.1/32 flowid 10:1
# tc filter add dev eth0 parent 10:0 protocol ip prio 1 u32 \
  match ip src 1.2.3.4/32 flowid 10:1
# tc filter add dev eth0 protocol ip parent 10: prio 2      \
  flowid 10:2

This assigns traffic to 4.3.2.1 and traffic from 1.2.3.4 to the highest priority queue, and the rest to the next-highest one.

You can concatenate matches, to match on traffic from 1.2.3.4 and from port 80, do this:


# tc filter add dev eth0 parent 10:0 protocol ip prio 1 u32 match ip src 4.3.2.1/32
  match ip sport 80 0xffff flowid 10:1

All the filtering commands you will normally need

Most shaping commands presented here start with this preamble:

# tc filter add dev eth0 parent 1:0 protocol ip prio 1 u32 ..

These are the so called 'u32' matches, which can match on ANY part of a packet.

On source/destination address

Source mask 'match ip src 1.2.3.0/24', destination mask 'match ip dst 4.3.2.0/24'. To match a single host, use /32, or omit the mask.

On source/destination port, all IP protocols

Source: 'match ip sport 80 0xffff', 'match ip dport 0xffff'

On ip protocol (tcp, udp, icmp, gre, ipsec)

Use the numbers from /etc/protocols, for example, icmp is 1: 'match ip protocol 1 0xff'.

On fwmark

You can mark packets with either ipchains and have that mark survive routing across interfaces. This is really useful to for example only shape traffic on eth1 that came in on eth0. Syntax: # tc filter add dev eth1 protocol ip parent 1:0 prio 1 handle 6 fw classid 1:1 Note that this is not a u32 match!

You can place a mark like this:

# iptables -A FORWARD -t mangle -i eth0 -j MARK --set-mark 6

The number 6 is arbitrary.

If you don't want to understand the full tc filter syntax, just use iptables, and only learn to select on fwmark.

For more filtering commands, see the Advanced Filters chapter.

Next Previous Contents