[Lustre-discuss] multi-homed lustre with both IB and TCP

Lee, Brett brett.lee at intel.com
Tue Mar 25 15:54:57 PDT 2014


John,

Sounds like a complex network, thus simplifying the problem might help.  One way to simplify would be to setup the client LNet config exactly as you think it should be, then try to "lctl ping" the MGS on each file system from *that* client.  If each works, you're close - if not, sniff the network to see if the client pings make it to the MGS's, and if they do, then check the route(s) back as well.

Occasionally I've found that the source knows how to route to the destination, but the destination has no route *back*.  Hence, lctl ping should allow you to test this out - at least for the MGS's.  After that come the other servers...

[Description: cid:image001.gif at 01C9FE3D.1D8A68C0]

Dr. Brett Lee, Solutions Architect
High Performance Data Division, Intel
+1.303.625.3595






From: lustre-discuss-bounces at lists.lustre.org [mailto:lustre-discuss-bounces at lists.lustre.org] On Behalf Of John Lalande
Sent: Tuesday, March 25, 2014 2:17 PM
To: lustre-discuss at lists.lustre.org
Subject: Re: [Lustre-discuss] multi-homed lustre with both IB and TCP

Hi, Ron-

Thanks for sharing your config with me. I tried tweaking ours, and it's still a no go. I think the main difference here is that it's our client (not the servers) that is multi-homed.

The client needs to access:

  1.  one (eventually more) Lustre filesystem(s) via direct attached InfiniBand.
  2.  one Lustre file system via TCP (no TCP->IB routing)
  3.  several Lustre file systems via routed InfiniBand (TCP->IB)
I can't get #1 and #3 working together ... can get one or the other working depending on how I've configured the lnet networks in modprobe.d/lustre.conf, but not both. (#2 works either way)

Does anyone else have ideas on this?

Thanks!

John


On 3/24/14, 4:13 PM, Jerome, Ron wrote:

Hi John,



Don't know if you got this working, but I can tell you that I have more or less the same setup  working.  Basically I have a client on a public TCP network connecting to an LNET router (via TCP) which then forwards via IB to the Luster cluster.  (all the lustre servers are multi-homed and have a tcp0 network internally, thus the "tcp1" for the external TCP network)



The external client config is... (where 132.246.x.x is the TCP address of the router)

---------------------------------

options lnet networks=tcp1(eth0) routes="o2ib0 132.246.x.x at tcp1<mailto:132.246.x.x at tcp1>"



...



Ron.



-----Original Message-----

From: lustre-discuss-bounces at lists.lustre.org<mailto:lustre-discuss-bounces at lists.lustre.org> [mailto:lustre-discuss-bounces at lists.lustre.org] On Behalf Of John Lalande

Sent: March 21, 2014 3:56 PM

To: lustre-discuss at lists.lustre.org<mailto:lustre-discuss at lists.lustre.org>

Subject: [Lustre-discuss] multi-homed lustre with both IB and TCP



Hi-



I am trying to set up a robinhood policy engine server that will watch

several different Lustre file systems -- one of which will have a direct

Infiniband connection, one via TCP without an intermediate Lustre router

and several other Lustre file systems via TCP through Lustre routers.



I can mount filesystems via IB and direct TCP, but not the routed ones.

(I am able to mount the routed ones if I take out the config for o2ib0 at ib0).



My modprobe.conf looks like this:

options lnet networks="o2ib0(ib0),tcp0(em1.497)" routes="o2ib1

ROUTER1_IP at tcp0; o2ib1 ROUTER2_IP at tcp0; o2ib1 ROUTER3_IP at tcp0"



where router1_IP, router2_IP, etc. are actual IP addresses on our

University's subnet that I don't want to publish here.



/etc/fstab looks like this:



172.17.1.5 at o2ib0:/ib_filesystem<mailto:172.17.1.5 at o2ib0:/ib_filesystem>    /ib_filesystem    lustre

defaults,_netdev,user_xattr    0 0

172.16.24.5 at o2ib1:/routedfs1<mailto:172.16.24.5 at o2ib1:/routedfs1>         /fs1            lustre

defaults,_netdev,user_xattr        0 0

172.16.23.14 at o2ib1:/routedfs2<mailto:172.16.23.14 at o2ib1:/routedfs2>      /fs2        lustre

defaults,_netdev,user_xattr     0 0

172.16.25.189 at o2ib1:/routedfs3<mailto:172.16.25.189 at o2ib1:/routedfs3>     /fs3        lustre

defaults,_netdev,user_xattr     0 0

172.16.25.241 at o2ib1:/routedfs4<mailto:172.16.25.241 at o2ib1:/routedfs4>       /fs4            lustre

defaults,_netdev,user_xattr        0 0

128.104.X.X at tcp:/tcpfs1<mailto:128.104.X.X at tcp:/tcpfs1>       /tcpfs1            lustre

defaults,_netdev        0 0



In dmesg, I see:

Lustre: 6923:0:(client.c:1868:ptlrpc_expire_one_request()) @@@ Request

sent has timed out for slow reply: [sent 1395431267/real 1395431267]

req at ffff880c2aa04800 x1463215106031860/t0(0)

o250->MGC172.16.24.5 at o2ib1@172.16.24.5 at o2ib1:26/25<mailto:MGC172.16.24.5 at o2ib1@172.16.24.5 at o2ib1:26/25> lens 400/544 e 0 to 1

dl 1395431272 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1

LustreError: 7239:0:(client.c:1052:ptlrpc_import_delay_req()) @@@ send

limit expired   req at ffff880c2aa04000 x1463215106031864/t0(0)

o101->MGC172.16.24.5 at o2ib1@172.16.24.5 at o2ib1:26/25<mailto:MGC172.16.24.5 at o2ib1@172.16.24.5 at o2ib1:26/25> lens 328/344 e 0 to 0

dl 0 ref 2 fl Rpc:W/0/ffffffff rc 0/-1

LustreError: 7230:0:(client.c:1052:ptlrpc_import_delay_req()) @@@ send

limit expired   req at ffff88182b1fac00 x1463215106031872/t0(0)

o101->MGC172.16.24.5 at o2ib1@172.16.24.5 at o2ib1:26/25<mailto:MGC172.16.24.5 at o2ib1@172.16.24.5 at o2ib1:26/25> lens 328/344 e 0 to 0

dl 0 ref 2 fl Rpc:W/0/ffffffff rc 0/-1

LustreError: 7230:0:(client.c:1052:ptlrpc_import_delay_req()) @@@ send

limit expired   req at ffff88182a1ab000 x1463215106031876/t0(0)

o101->MGC172.16.24.5 at o2ib1@172.16.24.5 at o2ib1:26/25<mailto:MGC172.16.24.5 at o2ib1@172.16.24.5 at o2ib1:26/25> lens 328/344 e 0 to 0

dl 0 ref 2 fl Rpc:W/0/ffffffff rc 0/-1

Lustre: 6923:0:(client.c:1868:ptlrpc_expire_one_request()) @@@ Request

sent has timed out for slow reply: [sent 1395431292/real 1395431292]

req at ffff88182a2a3400 x1463215106031976/t0(0)

o250->MGC172.16.24.5 at o2ib1@172.16.24.5 at o2ib1:26/25<mailto:MGC172.16.24.5 at o2ib1@172.16.24.5 at o2ib1:26/25> lens 400/544 e 0 to 1

dl 1395431302 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1

LustreError: 7239:0:(client.c:1052:ptlrpc_import_delay_req()) @@@ send

limit expired   req at ffff880c2aa04000 x1463215106031868/t0(0)

o101->MGC172.16.24.5 at o2ib1@172.16.24.5 at o2ib1:26/25<mailto:MGC172.16.24.5 at o2ib1@172.16.24.5 at o2ib1:26/25> lens 328/344 e 0 to 0

dl 0 ref 2 fl Rpc:W/0/ffffffff rc 0/-1



So ... is what we're trying to do here possible, and I'm just mangling

the config, or is Lustre over IB + Lustre via IB router not possible?



Thanks for any help!



John






--

John Lalande

Space Science & Engineering Center

University of Wisconsin - Madison

john.lalande at ssec.wisc.edu<mailto:john.lalande at ssec.wisc.edu> / 608-263-2268
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20140325/045ded6b/attachment.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image001.png
Type: image/png
Size: 5172 bytes
Desc: image001.png
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20140325/045ded6b/attachment.png>


More information about the lustre-discuss mailing list