[Lustre-discuss] Client evictions and RMDA failures

syed haider syed.haider at gmail.com
Tue Mar 31 08:38:12 PDT 2009


Hi Brian,

Thanks for the response. I've run a few ib tests and here is an
interesting response on the port for a failed node:

[root at tiger-node-0-1 ~]# ibqueryerrors.pl -c -a -r
Suppressing: RcvSwRelayErrors
Errors for 0x0008f104003f0e21 "ISR9288/ISR9096 Voltaire sLB-24"
   GUID 0x0008f104003f0e21 port 23: [XmtDiscards == 4]
         Actions:
          XmtDiscards: This is a symptom of congestion and may require
tweaking either HOQ or switch lifetime values

         Link info:      5   23[20]  ==( 4X 2.5 Gbps)==>
0x0008f10403970e20    1[  ] "tiger-node-0-11 HCA-1"
[root at tiger-node-0-1 ~]#

This is interesting because other sources state that my problem is
possibly related to an over-subscribed network even though there is no
traffic on the network when these nodes hang. Are you familar with
what settings need to be tweaked on a voltaire ib switch (9550) to
possibly resolve this problem? Unfortunately, my knowledge of ib is
minimal, any help is appreciate. Thanks!

On Tue, Mar 31, 2009 at 10:48 AM, Brian J. Murrell
<Brian.Murrell at sun.com> wrote:
> On Tue, 2009-03-31 at 10:29 -0400, syed haider wrote:
>>
>> when a node
>>
>> > hangs, it is unable to do an lctl ping to a OSS. For example, node-0-6
>>
>> > is hanging. From this node I can do an lctl ping to
>>
>> > oss-0-0, oss-0-2 and oss-0-3. Lctl ping to oss-0-1 just hangs. And if do
>>
>> > the same from oss-0-1 to node-0-6 I get the following error message:
>>
>> >
>>
>> > [root at tiger-oss-0-1 ~]# lctl ping 192.255.255.220 at o2ib
>>
>> > failed to ping 192.255.255.220 at o2ib: Input/output error
>
> That, together with all of the log messages you posted looks an awful
> lot like networking problems.  You need to find some independent method
> of testing network connectivity when this happens.  I think there are
> tools in the OFED distribution to test I/B networks.  You need to make
> sure that whatever test/tool you use utilizes RDMA as there are several
> communications channels in an I/B connection and LNET uses the RDMA
> channel.  ICMP ping on an I/B network is not an indicator that LNET will
> be happy with that network.
>
> Just because a piece of networking gear fails to report any errors would
> not for a minute make me believe there are none.  Only an empirical test
> would do that for me.
>
> Cheers,
> b.
>
>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>
>



More information about the lustre-discuss mailing list