[lustre-discuss] Odd client behavior with mixed Lustre versions

Kevin M. Hildebrand kevin at umd.edu
Thu Jan 10 13:23:11 PST 2019


I've got a RHEL6 Lustre installation where the servers are running 2.8.0,
that I'd prefer not to upgrade.
We've been running 2.8.0 on RHEL6 clients as well and everything's been
working fine.  However, I just updated the Linux release on the RHEL6
clients to 6.10, and Lustre 2.8.0 will no longer compile on the latest
kernel.  I've built and installed 2.10.6 on these clients, and the kernel
modules load fine, but on first contact with any lustre server, I get a
bunch of timeouts before I can get a valid connection.  The Lustre network
in this case is Infiniband, using Mellanox OFED on the clients.
'lctl ping' hangs for a few seconds and returns 'failed to ping
192.168.64.70 at o2ib1: Input/output error'.  An IPoIB ping of the server IP
address works fine.
At the same time I get a message in syslog that says 'LNet:
8778:0:(o2iblnd_cb.c:3192:kiblnd_check_conns()) Timed out tx for
192.168.64.70 at o2ib1: 4296292 seconds'
Nothing shows up in the logs on the server side.

If I repeat the 'lctl ping' a few times, after 30-60 seconds or so, 'lctl
ping' succeeds.
This happens for each of my Lustre servers, and once I get a successful
ping back, it seems to be fully functional up until the next reboot, or
until the Infiniband modules are reloaded.

If I try to mount the filesystem without doing the pings, I'll get timeouts
contacting the MDS for the same 30-60 seconds, and then once the MDSes are
reachable, I get timeouts to the OSSes for a while, until they become
reachable, and once they're all talking, all seems to be fine.

Any ideas on what could be wrong?

Thanks,
Kevin

--
Kevin Hildebrand
University of Maryland
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20190110/32a395e7/attachment.html>


More information about the lustre-discuss mailing list