[Lustre-discuss] Meaning of LND/neterrors ?

Eric Barton eeb at whamcloud.com
Fri Sep 24 05:45:27 PDT 2010


Aurelien,

> Eric Barton a écrit :
> > It's expected that peers will crash and therefore the low-level
> > network should not clutter the logs with noise and the upper
> > layers should handle the problem by retrying or doing actual
> > recovery.
> 
> Ok, so I can understand those errors to something like:
>   - my IB network is not so clean
>   - but Lustre upper layers will retry, and so this is transparent for them
> as long as i do not have too many of this kind of issue.
> 
> 
> > "RDMA failed" should really only occur when a peer node crashes.
> > However it could be a sign that there are deeper problems with
> > the network setup or hardware.
> 
> Ok, but in my case we have issue where nodes do not crash but we got this kind of issues, like:
> (this occurs on LNET routeurs)
> Tx -> ... cookie ... sending 1 waiting 0: failed  12
> Closing conn to ... : error -5 (waiting)
> 
> Even if the corresponding node is responding and Lustre works for it.

Then I'd suspect the IB network (switches and cabling).  If I were you,
I'd really want to root these problems out.  While they persist, Lustre
can evict clients spuriously and clients may appear to hang for many
seconds at a time.

> > If you suspect the network is
> > misbehaving, I'd run an LNET self-test.  This is well documented
> > in the manual (at least to people who already know how it works ;)
> > and lets you soak-test the network from any convenient node.
> 
> Ok :) I use it often, so that's ok.
> But lnet_selftest has difficulties to works nicely if your using different OFED stacks (at least
> v1.4.2 against v1.5.1).
> So it is difficult to use it as a test for my current issue.

Hmm - Lnet self-test doesn't care at all what the underlying networks
are so if networking breaks when you're using different OFED stacks,
I'd suspect the real problem is that OFED version interoperation doesn't
work when the network is under stress.  I'm not clear what guarantees on
version interoperation (if any) OFED makes, and even if it's supposed to
work, it could easily be buggy.

          Cheers,
                   Eric






More information about the lustre-discuss mailing list