[Lustre-discuss] Meaning of LND/neterrors ?

Aurelien Degremont aurelien.degremont at cea.fr
Thu Sep 23 00:57:04 PDT 2010


Eric Barton a écrit :
> It's expected that peers will crash and therefore the low-level
> network should not clutter the logs with noise and the upper
> layers should handle the problem by retrying or doing actual
> recovery.

Ok, so I can understand those errors to something like:
  - my IB network is not so clean
  - but Lustre upper layers will retry, and so this is transparent for them
as long as i do not have too many of this kind of issue.

> "RDMA failed" should really only occur when a peer node crashes.
> However it could be a sign that there are deeper problems with
> the network setup or hardware. 

Ok, but in my case we have issue where nodes do not crash but we got this kind of issues, like:
(this occurs on LNET routeurs)
Tx -> ... cookie ... sending 1 waiting 0: failed  12
Closing conn to ... : error -5 (waiting)

Even if the corresponding node is responding and Lustre works for it.

> If you suspect the network is
> misbehaving, I'd run an LNET self-test.  This is well documented
> in the manual (at least to people who already know how it works ;)
> and lets you soak-test the network from any convenient node.

Ok :) I use it often, so that's ok.
But lnet_selftest has difficulties to works nicely if your using different OFED stacks (at least v1.4.2 against v1.5.1).
So it is difficult to use it as a test for my current issue.



>           Cheers,
>                    Eric
>> -----Original Message-----
>> From: lustre-devel-bounces at lists.lustre.org [mailto:lustre-devel-bounces at lists.lustre.org] On Behalf
>> Of Aurelien Degremont
>> Sent: 22 September 2010 5:20 PM
>> To: lustre-devel at lists.lustre.org
>> Subject: [Lustre-devel] Meaning of LND/neterrors ?
>> Hello
>> I've noticed that Lustre network error, especially LND errors, are considered as maskable errors.
>> That means that on a production node, where debug mask is 0, those specific errors won't be displayed
>> if they happened.
>> Does that mean that they are harmless?
>> Do upper-layers resend their RPC/packet if LNDs report an error?
>> When, in my case, o2iblnd says something like "RDMA failed" (neterror). It is a big issue? Some RPC
>> were lost or not?
>> Thanks in advance
>> --
>> Aurelien Degremont
>> _______________________________________________
>> Lustre-devel mailing list
>> Lustre-devel at lists.lustre.org
>> http://lists.lustre.org/mailman/listinfo/lustre-devel

Aurelien Degremont

More information about the lustre-discuss mailing list