[lustre-discuss] Odd client behavior with mixed Lustre versions
Kevin M. Hildebrand
kevin at umd.edu
Thu Jan 24 10:08:23 PST 2019
So I'm still experimenting with my 2.10.6 clients mounting from a 2.8
server. I've found some more information that might narrow down the issue.
When a client is rebooted, or after the IB modules are reloaded, any Lustre
operations take a very long time to connect the first time.
lctl ping hangs and times out for 30-60 seconds. Once it makes a
successful connection, subsequent connections to the same server are fine.
So mounting the Lustre filesystem takes a long time as it has to time out
to each MDS and each OSS before finally succeeding.
If I do an IPoIB ping of the server I'm trying to reach first, the lctl
ping succeeds immediately. So if I ping all of the MDSes and OSSes, the
filesystem will mount immediately.
Does this sound familiar to anyone?
On Thu, Jan 10, 2019 at 4:23 PM Kevin M. Hildebrand <kevin at umd.edu> wrote:
> I've got a RHEL6 Lustre installation where the servers are running 2.8.0,
> that I'd prefer not to upgrade.
> We've been running 2.8.0 on RHEL6 clients as well and everything's been
> working fine. However, I just updated the Linux release on the RHEL6
> clients to 6.10, and Lustre 2.8.0 will no longer compile on the latest
> kernel. I've built and installed 2.10.6 on these clients, and the kernel
> modules load fine, but on first contact with any lustre server, I get a
> bunch of timeouts before I can get a valid connection. The Lustre network
> in this case is Infiniband, using Mellanox OFED on the clients.
> 'lctl ping' hangs for a few seconds and returns 'failed to ping
> 192.168.64.70 at o2ib1: Input/output error'. An IPoIB ping of the server IP
> address works fine.
> At the same time I get a message in syslog that says 'LNet:
> 8778:0:(o2iblnd_cb.c:3192:kiblnd_check_conns()) Timed out tx for
> 192.168.64.70 at o2ib1: 4296292 seconds'
> Nothing shows up in the logs on the server side.
> If I repeat the 'lctl ping' a few times, after 30-60 seconds or so, 'lctl
> ping' succeeds.
> This happens for each of my Lustre servers, and once I get a successful
> ping back, it seems to be fully functional up until the next reboot, or
> until the Infiniband modules are reloaded.
> If I try to mount the filesystem without doing the pings, I'll get
> timeouts contacting the MDS for the same 30-60 seconds, and then once the
> MDSes are reachable, I get timeouts to the OSSes for a while, until they
> become reachable, and once they're all talking, all seems to be fine.
> Any ideas on what could be wrong?
> Kevin Hildebrand
> University of Maryland
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the lustre-discuss