[lustre-discuss] Odd client behavior with mixed Lustre versions

Kevin M. Hildebrand kevin at umd.edu
Thu Jan 24 10:08:23 PST 2019


So I'm still experimenting with my 2.10.6 clients mounting from a 2.8
server.  I've found some more information that might narrow down the issue.

To recap:
When a client is rebooted, or after the IB modules are reloaded, any Lustre
operations take a very long time to connect the first time.
lctl ping hangs and times out for 30-60 seconds.  Once it makes a
successful connection, subsequent connections to the same server are fine.
So mounting the Lustre filesystem takes a long time as it has to time out
to each MDS and each OSS before finally succeeding.
What's new:
If I do an IPoIB ping of the server I'm trying to reach first, the lctl
ping succeeds immediately.  So if I ping all of the MDSes and OSSes, the
filesystem will mount immediately.

Does this sound familiar to anyone?

Thanks,
Kevin



On Thu, Jan 10, 2019 at 4:23 PM Kevin M. Hildebrand <kevin at umd.edu> wrote:

> I've got a RHEL6 Lustre installation where the servers are running 2.8.0,
> that I'd prefer not to upgrade.
> We've been running 2.8.0 on RHEL6 clients as well and everything's been
> working fine.  However, I just updated the Linux release on the RHEL6
> clients to 6.10, and Lustre 2.8.0 will no longer compile on the latest
> kernel.  I've built and installed 2.10.6 on these clients, and the kernel
> modules load fine, but on first contact with any lustre server, I get a
> bunch of timeouts before I can get a valid connection.  The Lustre network
> in this case is Infiniband, using Mellanox OFED on the clients.
> 'lctl ping' hangs for a few seconds and returns 'failed to ping
> 192.168.64.70 at o2ib1: Input/output error'.  An IPoIB ping of the server IP
> address works fine.
> At the same time I get a message in syslog that says 'LNet:
> 8778:0:(o2iblnd_cb.c:3192:kiblnd_check_conns()) Timed out tx for
> 192.168.64.70 at o2ib1: 4296292 seconds'
> Nothing shows up in the logs on the server side.
>
> If I repeat the 'lctl ping' a few times, after 30-60 seconds or so, 'lctl
> ping' succeeds.
> This happens for each of my Lustre servers, and once I get a successful
> ping back, it seems to be fully functional up until the next reboot, or
> until the Infiniband modules are reloaded.
>
> If I try to mount the filesystem without doing the pings, I'll get
> timeouts contacting the MDS for the same 30-60 seconds, and then once the
> MDSes are reachable, I get timeouts to the OSSes for a while, until they
> become reachable, and once they're all talking, all seems to be fine.
>
> Any ideas on what could be wrong?
>
> Thanks,
> Kevin
>
> --
> Kevin Hildebrand
> University of Maryland
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20190124/ab28d7df/attachment.html>


More information about the lustre-discuss mailing list