[Lustre-discuss] InfiniBand QoS with Lustre ko2iblnd.

Mon May 18 13:34:03 PDT 2009

On Mon, May 18, 2009 at 12:04:37PM +0200, Daniel Kobras wrote:
> Hi!
> 
> Does anyone know how to use QoS with Lustre's o2ib LND? The Voltaire IB
> LND allowed to #define a service level, but I couldn't find a similar
> facility in o2ib. Is there a different way to apply QoS rules?
> 
> Thanks,
> 
> Daniel.

Hi, I don't know much about this stuff, but our IB guys did use QoS
to help us when we found LNET was falling apart when we brought up
our first 1K node cluster based on quad socket, quad core opterons,
and ran MPI collective stress tests on all cores.

Here are some notes they put together - see the "QoS Policy file" section.

Jim
____________________________________
QoS configuration on Infiniband

May 18, 2009

Albert Chu
chu11 at llnl.gov

Overview
--------
Quality of Service (QoS) is offered in Infiniband as a means to offer some
guarantees/minimum requirements for certain applications on the fabric.

Definitions
-----------

Virtual Lanes (VLs): Infiniband supports up to 15 (numbered 0-14)
Virtual Lanes (VLs) for traffic.  The virtual lanes support
independent virtual transmit/receive buffers for each port on the
fabric.

Service Level (SL): A number (0-15) that can be assigned to any
Infiniband packet.  The definition/purpose of a SL is not defined.
It's up to the user to determine.

Basic QoS Implementation in Infiniband
--------------------------------------

There are three basic parts to QoS in Infiniband.

1) Assign/configure protocols/tool/applications to use appropriate
   SLs.

   Normally, you assign different SLs to different protocols,
   applications, etc. (i.e. MPI, Lustre).  This allows each
   protocol/application to be given unique QoS requirements.

2) Configure SL2VL mapping

   Map SLs to VLs.  For example, SL0->VL0, SL1->VL1, etc.

3) Configure VL Arbitration

   Determines VL transmission rules based on a set of prioritization
   rules.

It is the responsibility of administrators/users to use and configure
the SLs/VLs properly.  VLs and SLs do nothing/mean nothing in the
Infiniband card.

SL2VL Mapping Configuration
---------------------------

This is pretty basic.  You assign a SL to a VL.  It's a direct one to
one mapping.  i.e. SL1->VL1, SL2->VL2

Normally, you map SLX -> VLX.  If you do otherwise, you're starting to
do something pretty crazy.

VL Arbitration Configuration
----------------------------

This is not so basic.  There are three components to VL Arbitration
configuration, the High-Priority Table, the Low-Priority Table, and
the Limit of High Priority.

High/Low VL Arbitration Tables
------------------------------

High & Low Priority VL Arbitration Tables are a list of VL numbers
(0-14) and a weighting value (0-255) pairs.  The weighting value
indicates the number of 64 byte units that can be transmitted from
that VL when it is that VL's turn to transmit.  A weight of 0 means no
data can be transferred.  Counters are rounded up as needed for
packets (i.e. a weight of 1 means a packet > 64 bytes can still be
sent).  The High Priority VL Arbitration Table is weights for "high
priority" data while the Low Priority VL Arbitration Table is weights
for "low priority" data (the usefulness will make more sense after you
read "Limit of High Priority" below).

Note that 64*255 =~ 16K, which is small number for many institutions.
I think it is easiest to think of the weights as ratios for percentage
bandwidth if the network is completely flooded with data from all
protocols/applications.

For example:

A) VL0 Weight = 255, VL1 Weight = 255 

   50% bandwidth for VL0 and VL1 each.

B) VL0 Weight = 255, VL1 Weight = 255, VL2 Weight = 255 

   33% bandwidth for VL0, VL1, and VL2 each.

C) VL0 Weight = 200, VL1 Weight = 100 

   66% bandwidth for VL0, 33% bandwidth for VL1.

D) VL0 Weight = 200, VL1 Weight = 100, VL2 Weight = 100 

   50% bandwidth for VL0, 25% bandwidth for VL1 and VL2 each.

Limit of High Priority
----------------------

Indicates the number of high-priority packets (from the High VL
Arbitration Table) that can be sent without an opportunity to send a
low priority packet (from the Low VL Arbitration Table).  Increments
are in 4K bytes (special numbers, 0 = one packet.  255 = unlimited
data).

4K*254 =~ 1M, which again is small number for many institutions.  The
most likely numbers to consider using are:

0 - one packet
254 - max high limit data w/o being unlimited
255 - unlimited data

VL Arbitration Examples
-----------------------

When you combine the High/Low VL Arbitration tables with the Limit of
High Priority, you can create some interesting QoS behavior.

Example 1:

(Following example is borrowed from the "Quality and Service in OFED
3.1" presentation listed below.)

High-Limit: 0
VL-Arb-High: VL2 Weight = 1
VL-Arb-Low: VL0 Weight = 200, VL1 Weight = 50

Effectively, anytime any data on VL2 is available, send at most one
packet from VL2 before sending data from VL0 or VL1.  If no VL2 data
is available, VL0 gets 80% bandwidth, VL1 gets 20% of bandwidth.

Idea: 

(Assume Lustre Meta Data Servers and Lustre OSTs are on the same
fabric)

MPI -> SL0 -> VL0
Lustre OST Data -> SL1 -> VL1
Lustre Meta Data -> SL2 -> VL2

In this example, Lustre meta data traffic is assumed to be low, but
with the high priority, is accessed faster and theoretically allow for
better Lustre interaction.  When there is no Lustre meta data traffic
on the fabric, MPI is given the majority share of bandwidth b/c it is
more timing sensitive.

Example 2:

High-Limit: 254
Vl-Arb-High: VL0 Weight = 255
Vl-Arb-Low: VL1 Weight = 1

Effectively, whenever there is data on VL0, always send it before VL1.
But do not allow VL0 to starve VL1.  Let VL1 send *something* once in
awhile.

Idea: 

MPI -> SL1 -> VL0
Lustre -> Sl1 -> VL1

So MPI always gets priority over Lustre, but cannot starve it out.
The High-Limit of 254 means a low priority packet must be sent once in
awhile.  This could be important if Lustre "pings" are done to keep
some services alive.

Configuring for OpenSM
----------------------

Currently configure in /var/cache/opensm/opensm.opts (later to be in
/etc/opensm/opensm.conf).

#
# QoS OPTIONS
#
qos TRUE

qos_policy_file /var/cache/opensm/qos-policy.conf

# QoS default options
qos_max_vls 2
qos_high_limit 254
qos_vlarb_high 0:255
qos_vlarb_low 1:1
qos_sl2vl 0,1,15,15,15,15,15,15,15,15,15,15,15,15,15,15

qos_ca_max_vls 2
qos_ca_high_limit 254
qos_ca_vlarb_high 0:255
qos_ca_vlarb_low 1:1
qos_ca_sl2vl 0,1,15,15,15,15,15,15,15,15,15,15,15,15,15,15

# achu: VL2 not used, need to give non-null input to buggy opensm
qos_swe_max_vls 2
qos_swe_high_limit 255
qos_swe_vlarb_high 0:225,1:25
qos_swe_vlarb_low 2:1
qos_swe_sl2vl 0,1,15,15,15,15,15,15,15,15,15,15,15,15,15,15

Notes/Comments:

There are default QoS options, and specific QoS options 
for channel adapters, switches, etc.  They allow you to configure
for different port-types across the fabric.

The "max_vls" entries can be ignored.

The "high_limit", "vlarb_high", and "vlarb_low" fields are hopefully
self exaplanatory.  The "vlarb_high"/"vlarb_low" entries take inputs
as <VL>:<Weight> as input.

In the above example, channel Adapters have:

VL0 Weight = 255 -> For MPI

VL1 Weight = 1 -> For Lustre

Idea: With the High Limit of 254, MPI always gets priority, but cannot
starve Lustre.

In the above example, Switches have:

VL0 Weight = 225 -> For MPI
VL1 Weight = 25 -> For Lustre

Idea: Across the entire cluster, MPI, Lustre, etc. are going on from
different jobs/tasks.  We don't want MPI to starve out other traffic
so we give it a nice chunk of bandwidth but not all bandwidth (in this
example 90% for MPI, 10% for Lustre).

SLs to VLs are mapped by listing the VLs for each SL in increasing
order.  In the above example, SL0 -> VL0 and SL1 -> VL1.  The input of
15 is if the SL is one you don't care about.

Assigning SLs
-------------

The configuration of QoS is now over, but we still need to make
protocols/applications use the appropriate SL.

Some tools allow you to pick an SL when you run.

i.e. 

> mpirun -sl 0

However, it may not be easy to force/change users/applications to use
different SLs.  The easiest way to configure SLs is through the OpenSM
QoS policy file.

QoS Policy File
---------------

Depending on OpenSM version, this file is in
/var/cache/opensm/qos-policy.conf or /etc/opensm/qos-policy.conf.

The following is the short summary of options I think are needed for
our environment.  See "QoS Management in OpenSM" for full set of
options.

Format:

qos-ulps
    <user level protocol>, <options> : <SL level>
end-qos-ulps

<user level protocol> = IPoIB, SDP, SRP, iSER

<options> = port-num, pkey, service-id, target-port-guid 
(Note: options depends on which user level protocol is selected)

<SL level> = SL level 0-15.

Example:

qos-ulps
    default                                                     : 0
    any, target-port-guid 0x0002c9030002879d,0x0002c90300028765 : 1
end-qos-ulps

Idea: 

Everything (most notably MPI) defaults to SL0.  Any of the above
locations with the listed destination GUID gets SL1.

If the target-port-guid's list of GUIDs are Lustre Routers, that would
indicate Lustre data gets SL=1.  In combination with the VL
Arbitration and SL2VL Mapping configuration listed above, hopefully it
can be seen how MPI gets priority over Lustre, but does not starve it
out.

Note that files with target-port-guids must be kept up to date if
GUIDs change.  You can determine GUIDs via /usr/sbin/ibstat.

Verifying Configuration
-----------------------

The tool smpquery can be used to verify that VL Arbitration tables and
SL2VL tables have been configured in cards/switches properly.

# > /usr/sbin/smpquery sl2vl 346
# SL2VL table: Lid 346
#                 SL: | 0| 1| 2| 3| 4| 5| 6| 7| 8| 9|10|11|12|13|14|15|
ports: in  0, out  0: | 0| 1|15|15|15|15|15|15|15|15|15|15|15|15|15|15|

# > /usr/sbin/smpquery vlarb 346
# VLArbitration tables: Lid 346 port 0 LowCap 8 HighCap 8
# Low priority VL Arbitration Table:
VL    : |0x1 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |
WEIGHT: |0x1 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |
# High priority VL Arbitration Table:
VL    : |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |
WEIGHT: |0xFF|0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |

The high limit can be determined by issuing portinfo queries via
/usr/sbin/smpquery.

# > /usr/sbin/smpquery portinfo 346 | grep Limit
VLHighLimit:.....................0

Random Configuration Notes
--------------------------

SLs are most often assigned during Infiniband Queue Pair (QP) creation
time.  So, if you change your QoS settings, any tools/applications
(including Lustre) that are currently running and have already created
QPs may not have absorbed the newest QoS policy.  The appropriate
tools/applications should be restarted.

Not all Infiniband adapters support VLs.  Those that do many not
support all 15 VLs.  You can determine what your system supports by
issuing portinfo queries via /usr/sbin/smpquery.

References
----------

Qos Management in OpenSM

(this is a link to the Git Tree - hopefully the URL is always legit)

http://www.openfabrics.org/git/?p=~sashak/management.git;a=blob_plain;f=opensm/doc/QoS_management_in_OpenSM.txt;hb=HEAD

Quality and Service in OFED 3.1 - Liran Liss

http://www.openfabrics.org/archives/spring2008sonoma/Tuesday/qos_sonoma08_ofa_v1.ppt

QoS support in OFED

(this is a link to the Git Tree - the URL is on the ofed_1_4 branch,
so it probably will change at some point)

http://www.openfabrics.org/git/?p=~tziporet/docs.git;a=blob_plain;f=QoS_architecture.txt;hb=ofed_1_4