[Lustre-discuss] Client evictions and RMDA failures

Brian J. Murrell Brian.Murrell at Sun.COM
Tue Mar 31 07:48:08 PDT 2009


On Tue, 2009-03-31 at 10:29 -0400, syed haider wrote:
> 
> when a node
> 
> > hangs, it is unable to do an lctl ping to a OSS. For example, node-0-6
> 
> > is hanging. From this node I can do an lctl ping to
> 
> > oss-0-0, oss-0-2 and oss-0-3. Lctl ping to oss-0-1 just hangs. And if do
> 
> > the same from oss-0-1 to node-0-6 I get the following error message:
> 
> >
> 
> > [root at tiger-oss-0-1 ~]# lctl ping 192.255.255.220 at o2ib
> 
> > failed to ping 192.255.255.220 at o2ib: Input/output error

That, together with all of the log messages you posted looks an awful
lot like networking problems.  You need to find some independent method
of testing network connectivity when this happens.  I think there are
tools in the OFED distribution to test I/B networks.  You need to make
sure that whatever test/tool you use utilizes RDMA as there are several
communications channels in an I/B connection and LNET uses the RDMA
channel.  ICMP ping on an I/B network is not an indicator that LNET will
be happy with that network.

Just because a piece of networking gear fails to report any errors would
not for a minute make me believe there are none.  Only an empirical test
would do that for me.

Cheers,
b.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 197 bytes
Desc: This is a digitally signed message part
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20090331/2071c374/attachment.pgp>


More information about the lustre-discuss mailing list