[lustre-discuss] mounting over ipoib via opa (was: mount issue and ecmp?)

Michael Di Domenico mdidomenico4 at gmail.com
Thu Feb 14 10:01:46 PST 2019


just to keep people in the loop incase anyone else comes with a similar issue

i've narrowed this a bit further.  using iperf, i can

iperf full rate from compute node to compute node
iperf full rate from compute node to login node
iperf full rate from login node to lnet router

what i cannot do is iper from compute node to lnet router

iperf reports zeros when i do that.  however if i adjust the window
size on the iperf command line down to 1705 or less i can in fact pass
traffic successfully, but VERY slowly

conversely if i add rx_buffer_size=1400 and tx_buffer_size=1400 to the
ksocklnd module parameters, i can successfully mount the lustre
partition and see stuff (but its incredibly slow)

this clearly points to some weird transit error when going from opa to
ethernet on linux.  i've reached out to intel for help, but if anyone
has any ideas i'm all ears.

but this is likely not a lustre error, just an instigator



On Tue, Feb 12, 2019 at 8:22 AM Michael Di Domenico
<mdidomenico4 at gmail.com> wrote:
>
> thanks for the suggestion, but i'm not sure it's applicable in my
> scenario.  the storage and lnet routers do not have OPA cards
> installed, it's only the clients.  the storage does have a mix of
> mellanox, ethernet, and qdr hardware, but that's all working fine.  i
> have multiple clusters connected to the storage, on all three
> interconnects.
>
> i tried setting the arp filters in the document, but it hasn't made
> any difference.  i do have the opa tools from intel installed, but i
> had tried this with the rhel bundled opa drivers as well and got the
> same result.  i've tried building the lustre client with --o2ib=no,
> same result.  i tried connected vs datagram mode on the ipoib
> interface, same result.
>
> if i wasn't able to lctl ping the storage devices from the client, i
> would presume there's a network problem.  if i switch from ipoib to
> the ethernet mgmt interfaces on the clients, i can mount lustre, which
> should confirm and narrow it down to the ipoib interface specifically
> and not anything with the network/routing.  and since i have a bevy of
> other protocols running over ipoib (nfs/ssh/others) i'm pretty sure
> that localizes the issue to something with lustre
>
> if there's more debugging i can try i'm all ears
>
> the one message i get from client in syslog is
>
> Lustre: 253340:0:(client.c:2114:prlrpc_expire_one_request()) @@@
> Request sent has timed out for slow reply: [sent 1549976603/real
> 1549976603] req at ffff9c1d4bf40300 x1625268266467424/t0(0)
> o503->MGCxxx.xx.xx.xx at o2ib100@xxx.xx.xx.xx at o2ib100:26/25 lens 272/8416
> e 0 to 1 dl 154997615 ref 2 fl Rpc:X/0/ffffffffff rc 0/-1
>
> nothing gets reported on the MGS/MDS other then a client connection
> restored message.
>
>
> On Mon, Feb 11, 2019 at 3:15 PM Amir Shehata
> <amir.shehata.whamcloud at gmail.com> wrote:
> >
> > If your routers have multiple OPA/MLX interfaces we found that linux routing can return the wrong HW address, which causes address resolution error.
> >
> > You can try the following linux routing config to see if it helps:
> > https://wiki.whamcloud.com/display/LNet/MR+Cluster+Setup
> >
> > On Mon, 11 Feb 2019 at 12:04, Michael Di Domenico <mdidomenico4 at gmail.com> wrote:
> >>
> >> i've narrowed down that my issue seems to stem from running over ipoib
> >> on an opa network
> >>
> >> i managed to pull all the routing and other things around so the only
> >> difference was whether i road the ipoib or not
> >>
> >> when i mount via ethernet, it works fine
> >>
> >> when i try the same mount via ipoib running ontop of opa it gets
> >> "input/output error".  i can however lctl ping the storage and i see
> >> connections from the client to the MGS.  so some of the connectivity
> >> is working, but it's breaking down somewhere else
> >>
> >> is anyone else running over ipoib on an opa network?  if so, do you
> >> have lnet routing?
> >>
> >> some particulars
> >>
> >> rhel 7.6 clients
> >> 2.10.5 clients
> >> 2.5.x lustre servers (cray)
> >> lnet routing between storage and other networks
> >> currently running tcp ethernet, qdr infinipath, and fdr10 mellanox to
> >> the storage through routers
> >> no other machines are having mount issues
> >>
> >>
> >> On Fri, Feb 8, 2019 at 9:33 AM Michael Di Domenico
> >> <mdidomenico4 at gmail.com> wrote:
> >> > poking at this further, it doesn't look like it's ECMP issue.
> >> >
> >> > Are there any known reports of issues when running Lustre over ipoib
> >> > over an opa fabric?  seems a stretch, but it's the only difference in
> >> > the network at this point.
> >> >
> >> > can anyone suggest somewhere to look for more debug info?
> >> > /var/log/messages and dmesg, don't reveal much info
> >> >
> >> > On Mon, Feb 4, 2019 at 9:19 AM Michael Di Domenico
> >> > <mdidomenico4 at gmail.com> wrote:
> >> > >
> >> > > Has anyone heard of lustre having trouble mounting when ECMP is used
> >> > > on the compute nodes default gateway?
> >> > >
> >> > > I'm trying to mount an existing lustre filesystem on a new cluster,
> >> > > where the connections ride over OPA IPoIB, which is then converted to
> >> > > 10ge via four routers.  I'm using ECMP to distribute the packets over
> >> > > the four routers.
> >> > >
> >> > > I can mount lustre on other ethernet clients, but not the ones behind
> >> > > my ECMP gateways.  Changing the compute node gateway from ECMP to a
> >> > > single device doesn't change anything.  I'm not easily able to revert
> >> > > the network side from ECMP to a single route, so i haven't tried that.
> >> > >
> >> > > The output i get from mount is, "failed: Input/output error retries left: 0"
> >> > >
> >> > > syslog on the client and the MGS seem to show that the connection is
> >> > > being broken between the MGS and client during the mount with a "timed
> >> > > oout for slow reply" message.
> >> _______________________________________________
> >> lustre-discuss mailing list
> >> lustre-discuss at lists.lustre.org
> >> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


More information about the lustre-discuss mailing list