Linux Advanced Routing & Traffic Control HOWTO: Advanced & less common queueing disciplines

Whole document tree

www.fifi.org
    Documentation
        Manpages
        GNU Info
        Debian document tree
        Whole document tree
    Trigance web page
    Public services
    User info
    Mailing lists
    Secure server
    Multilingual usage

Validate HTML
Validate CSS

Whole document tree

Linux Advanced Routing & Traffic Control HOWTO: Advanced & less common queueing disciplines Next Previous Contents

14. Advanced & less common queueing disciplines

Besides the queues mentioned earlier, the kernel contains some other more specialized queues which are mentioned here, should you find that you have needs not addressed by the other queues.

14.1 bfifo/pfifo

These classless queues are even simpler than pfifo_fast in that they lack the internal bands - all traffic is really equal. They have one important benefit though, they have some statistics. So even if you don't need shaping or prioritizing, you can use this qdisc to determine the backlog on your interface.

pfifo has a length measured in packets, bfifo in bytes.

Parameters & usage

limit: Specifies the length of the queue. Measured in bytes for bfifo, in packets for pfifo. Defaults to the interface txqueuelen (see pfifo_fast chapter) packets long or txqueuelen*mtu bytes for bfifo.

14.2 Clark-Shenker-Zhang algorithm (CSZ)

This is so theoretical that not even Alexey (the main CBQ author) claims to understand it. From his source:

"David D. Clark, Scott Shenker and Lixia Zhang Supporting Real-Time Applications in an Integrated Services Packet Network: Architecture and Mechanism.

As I understand it, the main idea is to create WFQ flows for each guaranteed service and to allocate the rest of bandwith to dummy flow-0. Flow-0 comprises the predictive services and the best effort traffic; it is handled by a priority scheduler with the highest priority band allocated for predictive services, and the rest --- to the best effort packets.

Note that in CSZ flows are NOT limited to their bandwidth. It is supposed that the flow passed admission control at the edge of the QoS network and it doesn't need further shaping. Any attempt to improve the flow or to shape it to a token bucket at intermediate hops will introduce undesired delays and raise jitter.

At the moment CSZ is the only scheduler that provides true guaranteed service. Another schemes (including CBQ) do not provide guaranteed delay and randomize jitter."

Does not currently seem like a good canidate to use, unless you've read and understand the article mentioned.

14.3 DSMARK

Esteve Camps Chust <marvin@grn.es>
This text is an extract from my thesis on "QoS Support in Linux", September 2000.

Source documents:

Draft-almesberger-wajhak-diffserv-linux-01.txt.
Examples in iproute2 distribution.
White Paper-QoS protocols and architectures and IP QoS Frequently Asked Questions both by Quality of Service Forum.

This chapter was written by Esteve Camps <esteve@hades.udg.es>.

Introduction

First of all, first of all, it would be a great idea for you to read RFCs written about this (RFC2474, RFC2475, RFC2597 and RFC2598) at IETF DiffServ working Group web site and Werner Almesberger web site (he wrote the code to support Differentiated Services on Linux).

What is Dsmark related to?

Dsmark is a queueing discipline that offers the capabilities needed in Differentiated Services (also called DiffServ or, simply, DS). DiffServ is one of two actual QoS architectures (the other one is called Integrated Services) that is based on a value carried by packets in the DS field of the IP header.

One of the first solutions in IP designed to offer some QoS level was the Type of Service field (TOS byte) in IP header. By changing that value, we could choose a high/low level of throughput, delay or reliability. But this didn't provide sufficient flexibility to the needs of new services (such as real-time applications, interactive applications and others). After this, new architectures appeared. One of these was DiffServ which kept TOS bits and renamed DS field.

Differentiated Services guidelines

Differentiated Services is group-oriented. I mean, we don't know nothing about flows (this will be the Integrated Services purpose); we know about flow aggregations and we will apply different behaviours depending on which aggregation a packet belongs to.

When a packet arrives to an edge node (entry node to a DiffServ domain) entering to a DiffServ Domain we'll have to policy, shape and/or mark those packets (marking refers to assigning a value to the DS field. It's just like the cows :-) ). This will be the mark/value that the internal/core nodes on our DiffServ Domain will look at to determine which behaviour or QoS level apply.

As you can deduce, Differentiated Services involves a domain on which all DS rules will have to be applied. In fact you can think "I will classify all the packets entering my domain. Once they enter my domain they will be subjected to the rules that my classification dictates and every traversed node will apply that QoS level".

In fact, you can apply your own policies into your local domains, but some Service Level Agreements should be considered when connecting to other DS domains.

At this point, you maybe have a lot of questions. DiffServ is more than I've explained. In fact, you can understand that I can not resume more than 3 RFC's in just 50 lines :-).

Working with Dsmark

As the DiffServ bibliography specifies, we differentiate boundary nodes and interior nodes. These are two important points in the traffic path. Both types perform a classification when the packets arrive. Its result may be used in different places along the DS process before the packet is released to the network. It's just because of this that the diffserv code supplies an structure called sk_buff, including a new field called skb->tc_index where we'll store the result of initial classification that may be used in several points in DS treatment.

The skb->tc_index value will be initially set by the DSMARK qdisc, retrieving it from the DS field in IP header of every received packet. Besides, cls_tcindex classifier will read all or part of skb->tcindex value and use it to select classes.

But, first of all, take a look at DSMARK qdisc command and its parameters:


... dsmark indices INDICES [ default_index DEFAULT_INDEX ] [ set_tc_index ]

What do these parameters mean?

indices: size of table of (mask,value) pairs. Maximum value is 2^n, where n>=0.
Default_index: the default table entry index if classifier finds no match.
Set_tc_index: instructs dsmark discipline to retrieve the DS field and store it onto skb->tc_index.

Let's see the DSMARK process.

How SCH_DSMARK works.

This qdisc will apply the next steps:

If we have declared set_tc_index option in qdisc command, DS field is retrieved and stored onto skb->tc_index variable.
Classifier is invoked. The classifier will be executed and it will return a class ID that will be stored in skb->tc_index variable.If no filter matches are found, we consider the default_index option to be the classId to store. If neither set_tc_index nor default_index has been declared results may be impredictibles.
After been sent to internal qdisc's where you can reuse the result of the filter, the classid returned by the internal qdisc is stored into skb->tc_index. We will use this value in the future to index a mask- value table. The final result to assign to the packet will be that resulting from next operation:
New_Ds_field = ( Old_DS_field & mask ) | value
Thus, new value will result from "anding" ds_field and mask values and next, this result "ORed" with value parameter. See next diagram to understand all this process:


                         skb->ihp->tos
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - >
     |                                                       |     ^
     | -- If you declare set_tc_index, we set DS             |     |  <-----May change
     |    value into skb->tc_index variable                  |     |O       DS field
     |                                                      A|     |R
   +-|-+      +------+    +---+-+    Internal   +-+     +---N|-----|----+
   | | |      | tc   |--->|   | |-->  . . .  -->| |     |   D|     |    |
   | | |----->|index |--->|   | |     Qdisc     | |---->|    v     |    |
   | | |      |filter|--->| | | +---------------+ |   ---->(mask,value) |
-->| O |      +------+    +-|-+--------------^----+  /  |  (.  ,  .)    |
   | | |          ^         |                |       |  |  (.  ,  .)    |
   | | +----------|---------|----------------|-------|--+  (.  ,  .)    |
   | | sch_dsmark |         |                |       |                  |
   +-|------------|---------|----------------|-------|------------------+
     |            |         | <- tc_index -> |       |
     |            |(read)   |    may change  |       |  <--------------Index to the
     |            |         |                |       |                    (mask,value)
     v            |         v                v       |                    pairs table
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - ->
                         skb->tc_index

How to do marking? Just change the mask and value of the class you want to remark. See next line of code:

tc class change dev eth0 classid 1:1 dsmark mask 0x3 value 0xb8

This changes the (mask,value) pair in hash table, to remark packets belonging to class 1:1.You have to "change" this values because of default values that (mask,value) gets initially (see table below).

Now, we'll explain how TC_INDEX filter works and how fits into this. Besides, TCINDEX filter can be used in other configurations rather than those including DS services.

TC_INDEX Filter

This is the basic command to declare a TC_INDEX filter:


... tcindex [ hash SIZE ] [ mask MASK ] [ shift SHIFT ]
            [ pass_on | fall_through ]
            [ classid CLASSID ] [ police POLICE_SPEC ]

Next, we show the example used to explain TC_INDEX operation mode. Pay attention to bolded words:

tc qdisc add dev eth0 handle 1:0 root dsmark indices 64 set_tc_index
tc filter add dev eth0 parent 1:0 protocol ip prio 1 tcindex mask 0xfc shift 2
tc qdisc add dev eth0 parent 1:0 handle 2:0 cbq bandwidth 10Mbit cell 8 avpkt 1000 mpu 64
# EF traffic class
tc class add dev eth0 parent 2:0 classid 2:1 cbq bandwidth 10Mbit rate 1500Kbit avpkt 1000 prio 1 bounded isolated allot 1514 weight 1 maxburst 10
# Packet fifo qdisc for EF traffic
tc qdisc add dev eth0 parent 2:1 pfifo limit 5
tc filter add dev eth0 parent 2:0 protocol ip prio 1 handle 0x2e tcindex classid 2:1 pass_on

(This code is not complete. It's just an extract from EFCBQ example included in iproute2 distribution).

First of all, suppose we receive a packet marked as EF . If you read RFC2598, you'll see that DSCP recommended value for EF traffic is 101110. This means that DS field will be 10111000 (remember that less signifiant bits in TOS byte are not used in DS) or 0xb8 in hexadecimal codification.


              TC INDEX
              FILTER
   +---+      +-------+    +---+-+    +------+                +-+    +-------+
   |   |      |       |    |   | |    |FILTER|  +-+    +-+    | |    |       |
   |   |----->| MASK  | -> |   | | -> |HANDLE|->| |    | | -> | | -> |       |
   |   |  .   | =0xfc |    |   | |    |0x2E  |  | +----+ |    | |    |       |
   |   |  .   |       |    |   | |    +------+  +--------+    | |    |       |
   |   |  .   |       |    |   | |                            | |    |       |
-->|   |  .   | SHIFT |    |   | |                            | |    |       |-->
   |   |  .   | =2    |    |   | +----------------------------+ |    |       |
   |   |      |       |    |   |       CBQ 2:0                  |    |       |
   |   |      +-------+    +---+--------------------------------+    |       |
   |   |                                                             |       |
   |   +-------------------------------------------------------------+       |
   |                          DSMARK 1:0                                     |
   +-------------------------------------------------------------------------+

The packet arrives, then, set with 0xb8 value at DS field. As we explained before, dsmark qdisc identified by 1:0 id in the example, retrieves DS field and store it in skb->tc_index variable. Next step in the example will correspond to the filter associated to this qdisc (second line in the example). This will perform next operations:


Value1 = skb->tc_index & MASK
Key = Value1 >> SHIFT

In the example, MASK=0xFC i SHIFT=2.


Value1 = 10111000 & 11111100 = 10111000
Key = 10111000 >> 2 = 00101110 -> 0x2E in hexadecimal

The returned value will correspond to a qdisc interal filter handle (in the example, identifier 2:0). If a filter with this id exists, policing and metering conditions will be verified (in case that filter includes this) and the classid will be returned (in our example, classid 2:1) and stored in skb->tc_index variable.

But if any filter with that identifier is found, the result will depend on fall_through flag declaration. If so, value key is returned as classid. If not, an error is returned and process continues with the rest filters. Be careful if you use fall_through flag; this can be done if a simple relation exists between values
of skb->tc_index variable and class id's.

The latest parameters to comment on are hash and pass_on. The first one relates to hash table size. Pass_on will be used to indicate that if no classid equal to the result of this filter is found, try next filter. The default action is fall_through (look at next table).

Finally, let's see which possible values can be set to all this TCINDEX parameters:


TC Name                 Value           Default
-----------------------------------------------------------------
Hash                    1...0x10000     Implementation dependent
Mask                    0...0xffff      0xffff
Shift                   0...15          0
Fall through / Pass_on  Flag            Fall_through
Classid                 Major:minor     None
Police                  .....           None

This kind of filter is very powerful. It's necessary to explore all possibilities. Besides, this filter is not only used in DiffServ configurations. You can use it as any other kind of filter.

I recommend you to look at all DiffServ examples included in iproute2 distribution. I promise I will try to complement this text as soon as I can. Besides, all I have explained is the result of a lot of tests. I would thank you tell me if I'm wrong in any point.

14.4 Ingress policer qdisc

The Ingress qdisc comes in handy if you need to ratelimit a host without help from routers or other Linux boxes. You can police incoming bandwidth and drop packets when this bandwidth exceeds your desired rate. This can save your host from a SYN flood, for example, and also works to slow down TCP/IP, which responds to dropped packets by reducing speed.

FIXME: shaping by dropping packets seems less desirable than using, for example, a token bucket filter. Not sure though, Cisco CAR works this way, and people appear to be happy with it.

See the reference to IOS Committed Access Rate at the end of this document.

In short: you can use this to limit how fast your computer downloads files, thus leaving more of the available bandwidth for others.

The ingress policer is not a regular qdisc, although it looks like one. It is in fact hooked to the interface via the Netfilter infrastructure. Configuration is achieved using 'tc filter police' commands.

See the section on 'Protecting your host from SYN floods' for an example on how this works.

14.5 Random Early Detection (RED)

This section is meant as an introduction to backbone routing, which often involves <100 megabit bandwidths, which requires a different approach than your ADSL modem at home.

The normal behaviour of router queues on the Internet is called tail-drop. Tail-drop works by queueing up to a certain amount, then dropping all traffic that 'spills over'. This is very unfair, and also leads to retransmit synchronisation. When retransmit synchronisation occurs, the sudden burst of drops from a router that has reached its fill will cause a delayed burst of retransmits, which will over fill the congested router again.

In order to cope with transient congestion on links, backbone routers will often implement large queues. Unfortunately, while these queues are good for throughput, they can substantially increase latency and cause TCP connections to behave very bursty during congestion.

These issues with tail-drop are becoming increasingly troublesome on the Internet because the use of network unfriendly applications is increasing. The Linux kernel offers us RED, short for Random Early Detect, also called Random Early Drop, as that is how it works.

RED isn't a cure-all for this, applications which inappropriately fail to implement exponential backoff still get an unfair share of the bandwidth, however, with RED they do not cause as much harm to the throughput and latency of other connections.

RED statistically drops packets from flows before it reaches its hard limit. This causes a congested backbone link to slow more gracefully, and prevents retransmit synchronisation. This also helps TCP find its 'fair' speed faster by allowing some packets to get dropped sooner keeping queue sizes low and latency under control. The probability of a packet being dropped from a particular connection is proportional to its bandwidth usage rather than the number of packets it transmits.

RED is a good queue for backbones, where you can't afford the complexity of per-session state tracking needed by fairness queueing.

In order to use RED, you must decide on three parameters: Min, Max, and burst. Min sets the minimum queue size in bytes before dropping will begin, Max is a soft maximum that the algorithm will attempt to stay under, and burst sets the maximum number of packets that can 'burst through'.

You should set the min by calculating that highest acceptable base queueing latency you wish, and multiply it by your bandwidth. For instance, on my 64kbit/s ISDN link, I might want a base queueing latency of 200ms so I set min to 1600 bytes. Setting min too small will degrade throughput and too large will degrade latency. Setting a small min is not a replacement for reducing the MTU on a slow link to improve interactive response.

You should make max at least twice min to prevent synchronisation. On slow links with small min's it might be wise to make max perhaps four or more times large then min.

Burst controls how the RED algorithm responds to bursts. Burst must be set larger then min/avpkt. Experimentally, I've found (min+min+max)/(3*avpkt) to work okay.

Additionally, you need to set limit and avpkt. Limit is a safety value, after there are limit bytes in the queue, RED 'turns into' tail-drop. I typical set limit to eight times max. Avpkt should be your average packet size. 1000 works okay on high speed Internet links with a 1500byte MTU.

Read the paper on RED queueing by Sally Floyd and Van Jacobson for technical information.

14.6 Generic Random Early Detection

Not a lot is known about GRED. It looks like GRED with several internal queues, whereby the internal queue is chosen based on the Diffserv tcindex field. According to a slide found here, it contains the capabilities of Cisco's 'Distributed Weighted RED', as well as Dave Clark's RIO.

Each virtual queue can have its own Drop Parameters specified.

FIXME: get Jamal or Werner to tell us more

14.7 VC/ATM emulation

This is quite a major effort by Werner Almesberger to allow you to build Virtual Circuits over TCP/IP sockets. A Virtual Circuit is a concept from ATM network theory.

For more information, see the ATM on Linux homepage.

14.8 Weighted Round Robin (WRR)

This qdisc is not included in the standard kernels but can be downloaded from http://wipl-wrr.dkik.dk/wrr/. Currently the qdisc is only tested with Linux 2.2 kernels but it will probably work with 2.4/2.5 kernels too.

The WRR qdisc distributes bandwidth between its classes using the weighted round robin scheme. That is, like the CBQ qdisc it contains classes into which arbitrary qdiscs can be plugged. All classes which have sufficient demand will get bandwidth proportional to the weights associated with the classes. The weights can be set manually using the tc program. But they can also be made automatically decreasing for classes transferring much data.

The qdisc has a built-in classifier which assigns packets coming from or sent to different machines to different classes. Either the MAC or IP and either source or destination addresses can be used. The MAC address can only be used when the Linux box is acting as an ethernet bridge, however. The classes are automatically assigned to machines based on the packets seen.

The qdisc can be very useful at sites such as dorms where a lot of unrelated individuals share an Internet connection. A set of scripts setting up a relevant behavior for such a site is a central part of the WRR distribution.

Next Previous Contents