[Lustre-discuss] InfiniBand QoS with Lustre ko2iblnd.

Tue May 19 08:55:21 PDT 2009

Hi,

We took a slightly different approach to deal with IB QoS in Lustre.

We decided to assign a specific service-id to Lustre: in ofa-kernel we 
added a new value in the rdma_port_space enum, that we called 
RDMA_PS_LUSTRE. Then we modified the calls to rdma_create_id in 
o2iblnd.c and o2iblnd_cb.c to use this new port space value instead of 
RDMA_PS_TCP (well, we did a little more than that in the Lustre code, 
because we wanted the service-id to be a ko2iblnd module parameter, so 
we added some stuff in o2iblnd_modparams.c for instance).

The next step is to tell OpenSM to assign an SL to this service-id.
Here is an extract of our "QoS policy file":
qos-ulps
    default                                                     : 0
    any, service-id=0x.....: 3
end-qos-ulps

The major drawback of this solution is that the modification we made in 
the ofa-kernel is not OpenFabrics Alliance compliant, because the 
portspace list is defined in the IB standard.

Cheers,
Sebastien.

Jim Garlick a écrit :
> On Mon, May 18, 2009 at 12:04:37PM +0200, Daniel Kobras wrote:
>> Hi!
>>
>> Does anyone know how to use QoS with Lustre's o2ib LND? The Voltaire IB
>> LND allowed to #define a service level, but I couldn't find a similar
>> facility in o2ib. Is there a different way to apply QoS rules?
>>
>> Thanks,
>>
>> Daniel.
> 
> Hi, I don't know much about this stuff, but our IB guys did use QoS
> to help us when we found LNET was falling apart when we brought up
> our first 1K node cluster based on quad socket, quad core opterons,
> and ran MPI collective stress tests on all cores.
> 
> Here are some notes they put together - see the "QoS Policy file" section.
> 
> Jim
> ____________________________________
> QoS configuration on Infiniband
> 
> May 18, 2009
> 
> Albert Chu
> chu11 at llnl.gov
> 
> Overview
> --------
> Quality of Service (QoS) is offered in Infiniband as a means to offer some
> guarantees/minimum requirements for certain applications on the fabric.
> 
> Definitions
> -----------
> 
> Virtual Lanes (VLs): Infiniband supports up to 15 (numbered 0-14)
> Virtual Lanes (VLs) for traffic.  The virtual lanes support
> independent virtual transmit/receive buffers for each port on the
> fabric.
> 
> Service Level (SL): A number (0-15) that can be assigned to any
> Infiniband packet.  The definition/purpose of a SL is not defined.
> It's up to the user to determine.
> 
> Basic QoS Implementation in Infiniband
> --------------------------------------
> 
> There are three basic parts to QoS in Infiniband.
> 
> 1) Assign/configure protocols/tool/applications to use appropriate
>    SLs.
> 
>    Normally, you assign different SLs to different protocols,
>    applications, etc. (i.e. MPI, Lustre).  This allows each
>    protocol/application to be given unique QoS requirements.
> 
> 2) Configure SL2VL mapping
> 
>    Map SLs to VLs.  For example, SL0->VL0, SL1->VL1, etc.
> 
> 3) Configure VL Arbitration
> 
>    Determines VL transmission rules based on a set of prioritization
>    rules.
> 
> It is the responsibility of administrators/users to use and configure
> the SLs/VLs properly.  VLs and SLs do nothing/mean nothing in the
> Infiniband card.
> 
> SL2VL Mapping Configuration
> ---------------------------
> 
> This is pretty basic.  You assign a SL to a VL.  It's a direct one to
> one mapping.  i.e. SL1->VL1, SL2->VL2
> 
> Normally, you map SLX -> VLX.  If you do otherwise, you're starting to
> do something pretty crazy.
> 
> VL Arbitration Configuration
> ----------------------------
> 
> This is not so basic.  There are three components to VL Arbitration
> configuration, the High-Priority Table, the Low-Priority Table, and
> the Limit of High Priority.
> 
> High/Low VL Arbitration Tables
> ------------------------------
> 
> High & Low Priority VL Arbitration Tables are a list of VL numbers
> (0-14) and a weighting value (0-255) pairs.  The weighting value
> indicates the number of 64 byte units that can be transmitted from
> that VL when it is that VL's turn to transmit.  A weight of 0 means no
> data can be transferred.  Counters are rounded up as needed for
> packets (i.e. a weight of 1 means a packet > 64 bytes can still be
> sent).  The High Priority VL Arbitration Table is weights for "high
> priority" data while the Low Priority VL Arbitration Table is weights
> for "low priority" data (the usefulness will make more sense after you
> read "Limit of High Priority" below).
> 
> Note that 64*255 =~ 16K, which is small number for many institutions.
> I think it is easiest to think of the weights as ratios for percentage
> bandwidth if the network is completely flooded with data from all
> protocols/applications.
> 
> For example:
> 
> A) VL0 Weight = 255, VL1 Weight = 255 
> 
>    50% bandwidth for VL0 and VL1 each.
> 
> B) VL0 Weight = 255, VL1 Weight = 255, VL2 Weight = 255 
> 
>    33% bandwidth for VL0, VL1, and VL2 each.
> 
> C) VL0 Weight = 200, VL1 Weight = 100 
> 
>    66% bandwidth for VL0, 33% bandwidth for VL1.
> 
> D) VL0 Weight = 200, VL1 Weight = 100, VL2 Weight = 100 
> 
>    50% bandwidth for VL0, 25% bandwidth for VL1 and VL2 each.
> 
> Limit of High Priority
> ----------------------
> 
> Indicates the number of high-priority packets (from the High VL
> Arbitration Table) that can be sent without an opportunity to send a
> low priority packet (from the Low VL Arbitration Table).  Increments
> are in 4K bytes (special numbers, 0 = one packet.  255 = unlimited
> data).
> 
> 4K*254 =~ 1M, which again is small number for many institutions.  The
> most likely numbers to consider using are:
> 
> 0 - one packet
> 254 - max high limit data w/o being unlimited
> 255 - unlimited data
> 
> VL Arbitration Examples
> -----------------------
> 
> When you combine the High/Low VL Arbitration tables with the Limit of
> High Priority, you can create some interesting QoS behavior.
> 
> Example 1:
> 
> (Following example is borrowed from the "Quality and Service in OFED
> 3.1" presentation listed below.)
> 
> High-Limit: 0
> VL-Arb-High: VL2 Weight = 1
> VL-Arb-Low: VL0 Weight = 200, VL1 Weight = 50
> 
> Effectively, anytime any data on VL2 is available, send at most one
> packet from VL2 before sending data from VL0 or VL1.  If no VL2 data
> is available, VL0 gets 80% bandwidth, VL1 gets 20% of bandwidth.
> 
> Idea: 
> 
> (Assume Lustre Meta Data Servers and Lustre OSTs are on the same
> fabric)
> 
> MPI -> SL0 -> VL0
> Lustre OST Data -> SL1 -> VL1
> Lustre Meta Data -> SL2 -> VL2
> 
> In this example, Lustre meta data traffic is assumed to be low, but
> with the high priority, is accessed faster and theoretically allow for
> better Lustre interaction.  When there is no Lustre meta data traffic
> on the fabric, MPI is given the majority share of bandwidth b/c it is
> more timing sensitive.
> 
> Example 2:
> 
> High-Limit: 254
> Vl-Arb-High: VL0 Weight = 255
> Vl-Arb-Low: VL1 Weight = 1
> 
> Effectively, whenever there is data on VL0, always send it before VL1.
> But do not allow VL0 to starve VL1.  Let VL1 send *something* once in
> awhile.
> 
> Idea: 
> 
> MPI -> SL1 -> VL0
> Lustre -> Sl1 -> VL1
> 
> So MPI always gets priority over Lustre, but cannot starve it out.
> The High-Limit of 254 means a low priority packet must be sent once in
> awhile.  This could be important if Lustre "pings" are done to keep
> some services alive.
> 
> Configuring for OpenSM
> ----------------------
> 
> Currently configure in /var/cache/opensm/opensm.opts (later to be in
> /etc/opensm/opensm.conf).
> 
> #
> # QoS OPTIONS
> #
> qos TRUE
> 
> qos_policy_file /var/cache/opensm/qos-policy.conf
> 
> # QoS default options
> qos_max_vls 2
> qos_high_limit 254
> qos_vlarb_high 0:255
> qos_vlarb_low 1:1
> qos_sl2vl 0,1,15,15,15,15,15,15,15,15,15,15,15,15,15,15
> 
> qos_ca_max_vls 2
> qos_ca_high_limit 254
> qos_ca_vlarb_high 0:255
> qos_ca_vlarb_low 1:1
> qos_ca_sl2vl 0,1,15,15,15,15,15,15,15,15,15,15,15,15,15,15
> 
> # achu: VL2 not used, need to give non-null input to buggy opensm
> qos_swe_max_vls 2
> qos_swe_high_limit 255
> qos_swe_vlarb_high 0:225,1:25
> qos_swe_vlarb_low 2:1
> qos_swe_sl2vl 0,1,15,15,15,15,15,15,15,15,15,15,15,15,15,15
> 
> Notes/Comments:
> 
> There are default QoS options, and specific QoS options 
> for channel adapters, switches, etc.  They allow you to configure
> for different port-types across the fabric.
> 
> The "max_vls" entries can be ignored.
> 
> The "high_limit", "vlarb_high", and "vlarb_low" fields are hopefully
> self exaplanatory.  The "vlarb_high"/"vlarb_low" entries take inputs
> as <VL>:<Weight> as input.
> 
> In the above example, channel Adapters have:
> 
> VL0 Weight = 255 -> For MPI
> 
> VL1 Weight = 1 -> For Lustre
> 
> Idea: With the High Limit of 254, MPI always gets priority, but cannot
> starve Lustre.
> 
> In the above example, Switches have:
> 
> VL0 Weight = 225 -> For MPI
> VL1 Weight = 25 -> For Lustre
> 
> Idea: Across the entire cluster, MPI, Lustre, etc. are going on from
> different jobs/tasks.  We don't want MPI to starve out other traffic
> so we give it a nice chunk of bandwidth but not all bandwidth (in this
> example 90% for MPI, 10% for Lustre).
> 
> SLs to VLs are mapped by listing the VLs for each SL in increasing
> order.  In the above example, SL0 -> VL0 and SL1 -> VL1.  The input of
> 15 is if the SL is one you don't care about.
> 
> Assigning SLs
> -------------
> 
> The configuration of QoS is now over, but we still need to make
> protocols/applications use the appropriate SL.
> 
> Some tools allow you to pick an SL when you run.
> 
> i.e. 
> 
>> mpirun -sl 0
> 
> However, it may not be easy to force/change users/applications to use
> different SLs.  The easiest way to configure SLs is through the OpenSM
> QoS policy file.
> 
> QoS Policy File
> ---------------
> 
> Depending on OpenSM version, this file is in
> /var/cache/opensm/qos-policy.conf or /etc/opensm/qos-policy.conf.
> 
> The following is the short summary of options I think are needed for
> our environment.  See "QoS Management in OpenSM" for full set of
> options.
> 
> Format:
> 
> qos-ulps
>     <user level protocol>, <options> : <SL level>
> end-qos-ulps
> 
> <user level protocol> = IPoIB, SDP, SRP, iSER
> 
> <options> = port-num, pkey, service-id, target-port-guid 
> (Note: options depends on which user level protocol is selected)
> 
> <SL level> = SL level 0-15.
> 
> Example:
> 
> qos-ulps
>     default                                                     : 0
>     any, target-port-guid 0x0002c9030002879d,0x0002c90300028765 : 1
> end-qos-ulps
> 
> Idea: 
> 
> Everything (most notably MPI) defaults to SL0.  Any of the above
> locations with the listed destination GUID gets SL1.
> 
> If the target-port-guid's list of GUIDs are Lustre Routers, that would
> indicate Lustre data gets SL=1.  In combination with the VL
> Arbitration and SL2VL Mapping configuration listed above, hopefully it
> can be seen how MPI gets priority over Lustre, but does not starve it
> out.
> 
> Note that files with target-port-guids must be kept up to date if
> GUIDs change.  You can determine GUIDs via /usr/sbin/ibstat.
> 
> Verifying Configuration
> -----------------------
> 
> The tool smpquery can be used to verify that VL Arbitration tables and
> SL2VL tables have been configured in cards/switches properly.
> 
> # > /usr/sbin/smpquery sl2vl 346
> # SL2VL table: Lid 346
> #                 SL: | 0| 1| 2| 3| 4| 5| 6| 7| 8| 9|10|11|12|13|14|15|
> ports: in  0, out  0: | 0| 1|15|15|15|15|15|15|15|15|15|15|15|15|15|15|
> 
> # > /usr/sbin/smpquery vlarb 346
> # VLArbitration tables: Lid 346 port 0 LowCap 8 HighCap 8
> # Low priority VL Arbitration Table:
> VL    : |0x1 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |
> WEIGHT: |0x1 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |
> # High priority VL Arbitration Table:
> VL    : |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |
> WEIGHT: |0xFF|0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |
> 
> The high limit can be determined by issuing portinfo queries via
> /usr/sbin/smpquery.
> 
> # > /usr/sbin/smpquery portinfo 346 | grep Limit
> VLHighLimit:.....................0
> 
> Random Configuration Notes
> --------------------------
> 
> SLs are most often assigned during Infiniband Queue Pair (QP) creation
> time.  So, if you change your QoS settings, any tools/applications
> (including Lustre) that are currently running and have already created
> QPs may not have absorbed the newest QoS policy.  The appropriate
> tools/applications should be restarted.
> 
> Not all Infiniband adapters support VLs.  Those that do many not
> support all 15 VLs.  You can determine what your system supports by
> issuing portinfo queries via /usr/sbin/smpquery.
> 
> References
> ----------
> 
> Qos Management in OpenSM
> 
> (this is a link to the Git Tree - hopefully the URL is always legit)
> 
> http://www.openfabrics.org/git/?p=~sashak/management.git;a=blob_plain;f=opensm/doc/QoS_management_in_OpenSM.txt;hb=HEAD
> 
> Quality and Service in OFED 3.1 - Liran Liss
> 
> http://www.openfabrics.org/archives/spring2008sonoma/Tuesday/qos_sonoma08_ofa_v1.ppt
> 
> QoS support in OFED
> 
> (this is a link to the Git Tree - the URL is on the ofed_1_4 branch,
> so it probably will change at some point)
> 
> http://www.openfabrics.org/git/?p=~tziporet/docs.git;a=blob_plain;f=QoS_architecture.txt;hb=ofed_1_4
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
> 
>