[lustre-discuss] Lnet ping issue

Kurt Strosahl strosahl at jlab.org
Mon Jul 27 13:08:44 PDT 2020


Good Afternoon,

I'm experiencing an odd issue with one of my lustre clients.   The system seems to be having an issue talking to one of the oss systems.  When it reboots it is somehow mounting lustre twice.  attempts to use lctl ping from the client to the OSS return the following error:

~] lctl ping 172.17.0.98 at o2ib
  │····failed to ping 172.17.0.98 at o2ib: Input/output error

Conventional ping works

When I try to ping from the OSS side the lctl ping command hangs indefinitely.  Looking in dmesg I see the following:
[17291774.980764] LNet: 86013:0:(api-ni.c:4116:lnet_ping()) ping 12345-172.17.0.30 at o2ib: late network completion                                                     │····
[17292374.970610] LNet: 86013:0:(api-ni.c:4116:lnet_ping()) ping 12345-172.17.0.30 at o2ib: late network completion                                                     │····
[17292974.961746] LNet: 86013:0:(api-ni.c:4116:lnet_ping()) ping 12345-172.17.0.30 at o2ib: late network completion                                                     │····
[17293602.500931] LNet: 174596:0:(api-ni.c:4116:lnet_ping()) ping 12345-172.17.0.30 at o2ib: late network completion                                                    │····
[17294234.941320] LNet: 86013:0:(api-ni.c:4116:lnet_ping()) ping 12345-172.17.0.30 at o2ib: late network completion

A further oddity is that mounting the lustre area seems to generate a double mount (when I unmount it by hand I have to do it twice to get it to unmount and it shows up twice in /proc/mounts

The client is running the following:
CentOS Linux release 7.3.1611 (Core)
kernel: 3.10.0-514.el7.x86_64
rpm -qa | grep lustre                                                                                                                              │····
lustre-client-2.10.5-1.el7.centos.x86_64                                                                                                                             │····
kmod-lustre-client-2.10.5-1.el7.centos.x86_64

It has a qdr infiniband interface

The OSS has the following:
CentOS Linux release 7.6.1810 (Core)
3.10.0-957.10.1.el7_lustre.x86_64
rpm -qa | grep lustre                                                                                                                              │····
lustre-client-2.10.5-1.el7.centos.x86_64                                                                                                                             │····
kmod-lustre-client-2.10.5-1.el7.centos.x86_64
and an FDR interface

Cables for the client have been swapped, and different qdr switches have been used.

The client needs to stay at that version of luster so it can connect to another, older, lustre file system.

Thank you,

Kurt J. Strosahl
System Administrator: Lustre, HPC
Scientific Computing Group, Thomas Jefferson National Accelerator Facility
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20200727/aae59a96/attachment.html>


More information about the lustre-discuss mailing list