[lustre-discuss] mounting over ipoib via opa (was: mount issue and ecmp?)

Michael Di Domenico mdidomenico4 at gmail.com
Mon Feb 11 12:04:09 PST 2019


i've narrowed down that my issue seems to stem from running over ipoib
on an opa network

i managed to pull all the routing and other things around so the only
difference was whether i road the ipoib or not

when i mount via ethernet, it works fine

when i try the same mount via ipoib running ontop of opa it gets
"input/output error".  i can however lctl ping the storage and i see
connections from the client to the MGS.  so some of the connectivity
is working, but it's breaking down somewhere else

is anyone else running over ipoib on an opa network?  if so, do you
have lnet routing?

some particulars

rhel 7.6 clients
2.10.5 clients
2.5.x lustre servers (cray)
lnet routing between storage and other networks
currently running tcp ethernet, qdr infinipath, and fdr10 mellanox to
the storage through routers
no other machines are having mount issues


On Fri, Feb 8, 2019 at 9:33 AM Michael Di Domenico
<mdidomenico4 at gmail.com> wrote:
> poking at this further, it doesn't look like it's ECMP issue.
>
> Are there any known reports of issues when running Lustre over ipoib
> over an opa fabric?  seems a stretch, but it's the only difference in
> the network at this point.
>
> can anyone suggest somewhere to look for more debug info?
> /var/log/messages and dmesg, don't reveal much info
>
> On Mon, Feb 4, 2019 at 9:19 AM Michael Di Domenico
> <mdidomenico4 at gmail.com> wrote:
> >
> > Has anyone heard of lustre having trouble mounting when ECMP is used
> > on the compute nodes default gateway?
> >
> > I'm trying to mount an existing lustre filesystem on a new cluster,
> > where the connections ride over OPA IPoIB, which is then converted to
> > 10ge via four routers.  I'm using ECMP to distribute the packets over
> > the four routers.
> >
> > I can mount lustre on other ethernet clients, but not the ones behind
> > my ECMP gateways.  Changing the compute node gateway from ECMP to a
> > single device doesn't change anything.  I'm not easily able to revert
> > the network side from ECMP to a single route, so i haven't tried that.
> >
> > The output i get from mount is, "failed: Input/output error retries left: 0"
> >
> > syslog on the client and the MGS seem to show that the connection is
> > being broken between the MGS and client during the mount with a "timed
> > oout for slow reply" message.


More information about the lustre-discuss mailing list