[Lustre-discuss] Multiple IB ports

Fri Apr 1 14:34:09 PDT 2011

On Sun, 2011-03-20 at 22:53 -0500, Brian O'Connor wrote:

>     Any body actually using multiple IB ports on a client for an
> aggregated connection?

I am trying to do something like what you mentioned.  I am working on a
machine with multiple IB ports, but rather than trying to aggregate
links, I am just trying to direct Lustre traffic over different IB ports
so there will essentially be a single QDR IB link dedicated to each
MDS/OSS server.  Below are some of the main details.  (I can provide
more detailed info if you think it would be useful.)

The storage is a DDN SFA10k couplet with 28 LUNs.  Each controller in
the couplet has 4 QDR IB ports, but only 2 on each controller are
connected to the IB fabric.  The is a single MGS/MDS server and 4 OSS
servers.  All servers have a single QDR IB port connected to the fabric.
Each OSS node does SRP login to a different DDN port and serves out 7 of
the 28 OSTs.  The lustre client is a SGI UV1000 (1024 cores, 4TB RAM)
with 24 QDR IB ports (of which we are currently only using 5 ports).

The 5 MDS/OSS servers have their single IB ports configured on 2
different lnets.  All 5 servers have o2ib0 configured as well as a
specific lnet for that server (oss1 => o2ib1, oss2->o2ib2, ...,
mds->o2ib5).  The client has lnets o2ib[1-5] configured (one on each of
the 5 IB ports).  I also had to configure some static ip routes on the
client so that each lustre server could ping the corresponding port on
the client.

I am still doing performance testing and playing around with
configuration parameters.  In general, I am getting performance that is
better than using a single QDR IB link, but it certainly is not scaling
up linearly.  I can't say for sure where the bottleneck is.  It could be
a misconfiguration on my part, some limitation I am hitting within
lustre, or just the natural result of running lustre on a giant single
system image SMP machine.  (Although I am pretty sure that at least part
of the problem is due to poor NUMA remote memory access performance.)

-- 
Rick Mohr
HPC Systems Administrator
National Institute for Computational Sciences
http://www.nics.tennessee.edu/