[lustre-discuss] Odd client behavior with mixed Lustre versions

Kevin M. Hildebrand kevin at umd.edu
Tue Jan 15 04:49:10 PST 2019


Yeah, I thought about that.  Both the client and servers are using the
defaults for ko2iblnd-

          tunables:
              peer_timeout: 180
              peer_credits: 8
              peer_buffer_credits: 0
              credits: 256
          lnd tunables:
              peercredits_hiw: 4
              map_on_demand: 0
              concurrent_sends: 8
              fmr_pool_size: 512
              fmr_flush_trigger: 384
              fmr_cache: 1
              ntx: 512
              conns_per_peer: 1

Thanks,
Kevin

On Fri, Jan 11, 2019 at 5:17 PM Mohr Jr, Richard Frank (Rick Mohr) <
rmohr at utk.edu> wrote:

> Is it possible you have some incompatible ko2iblnd module parameters
> between the 2.8 servers and the 2.10 clients?  If there was something
> causing LNet issues, that could possibly explain some of the symptoms you
> are seeing.
>
> --
> Rick Mohr
> Senior HPC System Administrator
> National Institute for Computational Sciences
> http://www.nics.tennessee.edu
>
>
> > On Jan 10, 2019, at 4:23 PM, Kevin M. Hildebrand <kevin at umd.edu> wrote:
> >
> > I've got a RHEL6 Lustre installation where the servers are running
> 2.8.0, that I'd prefer not to upgrade.
> > We've been running 2.8.0 on RHEL6 clients as well and everything's been
> working fine.  However, I just updated the Linux release on the RHEL6
> clients to 6.10, and Lustre 2.8.0 will no longer compile on the latest
> kernel.  I've built and installed 2.10.6 on these clients, and the kernel
> modules load fine, but on first contact with any lustre server, I get a
> bunch of timeouts before I can get a valid connection.  The Lustre network
> in this case is Infiniband, using Mellanox OFED on the clients.
> > 'lctl ping' hangs for a few seconds and returns 'failed to ping
> 192.168.64.70 at o2ib1: Input/output error'.  An IPoIB ping of the server IP
> address works fine.
> > At the same time I get a message in syslog that says 'LNet:
> 8778:0:(o2iblnd_cb.c:3192:kiblnd_check_conns()) Timed out tx for
> 192.168.64.70 at o2ib1: 4296292 seconds'
> > Nothing shows up in the logs on the server side.
> >
> > If I repeat the 'lctl ping' a few times, after 30-60 seconds or so,
> 'lctl ping' succeeds.
> > This happens for each of my Lustre servers, and once I get a successful
> ping back, it seems to be fully functional up until the next reboot, or
> until the Infiniband modules are reloaded.
> >
> > If I try to mount the filesystem without doing the pings, I'll get
> timeouts contacting the MDS for the same 30-60 seconds, and then once the
> MDSes are reachable, I get timeouts to the OSSes for a while, until they
> become reachable, and once they're all talking, all seems to be fine.
> >
> > Any ideas on what could be wrong?
> >
> > Thanks,
> > Kevin
> >
> > --
> > Kevin Hildebrand
> > University of Maryland
> > _______________________________________________
> > lustre-discuss mailing list
> > lustre-discuss at lists.lustre.org
> > http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20190115/c520a1b2/attachment.html>


More information about the lustre-discuss mailing list