[lustre-discuss] LNET IB intermittent connection

Nathan Crawford nrcrawfo at uci.edu
Thu Feb 11 15:32:26 PST 2021


Hi Colin,

  I've done checks of the performance/error counters, and used the
in-OS-repo version ibdiagnet. Apart from a couple nodes with known failing
cables/HCAs (not involved in lnet connectino probs), the fabric was
healthy. It did pick up that the IPoIB partition was still at 20gbit/s from
when we had a couple DDR connections, so increasing that to 40 may help.

  The current suspect is that the ZFS pools under the OSTs recently got
much too close to capacity (>%90), and are taking longer times to process
IO. Is there a set of timeouts to increase or thresholds to loosen in order
to cope?

Thanks,
Nate

On Wed, Feb 10, 2021 at 3:24 PM Colin Faber <cfaber at gmail.com> wrote:

> Hi Nathan,
>
> Have you examined the underlying fabric to ensure it's functioning
> correctly?
>
>
> https://www.mellanox.com/products/adapter-software/infiniband-management-and-monitoring-tools
> might interest you
>
> -cf
>
> On Wed, Feb 10, 2021 at 3:54 PM Nathan Crawford <nrcrawfo at uci.edu> wrote:
>
>> Hi All,
>>
>>   I've recently been having a bunch of LNET over Infiniband
>> connection-lost/-restored errors and am trying to find the cause and/or
>> tune the system to better cope. There is a lot of stuff on the wiki (
>> https://wiki.lustre.org/Lustre_Resiliency:_Understanding_Lustre_Message_Loss_and_Tuning_for_Resiliency),
>> but that's from 2016, and I don't know what parts are superseded. I'm
>> currently running Lustre 2.12.5 on CentOS 7.8, with a mix of Q-Logic/Intel
>> QDR and Mellanox EDR HCAs and switches (using CentOS in-box RDMA/opensm).
>>
>>   Is there a better place to look (e.g. the fine manual, section X) for
>> guidance? I've done a few searches on the Jira, but the most similar errors
>> should have already been fixed in earlier releases.
>>
>>   Assuming that there is actually some impending hardware issue, can LNET
>> be easily configured to go over the @tcp connection when the @o2ib flakes
>> out?
>>
>> Thanks,
>> Nate
>>
>> --
>>
>> Dr. Nathan Crawford              nathan.crawford at uci.edu
>> Director of Scientific Computing
>> School of Physical Sciences
>> 164 Rowland Hall                 Office: 2101 Natural Sciences II
>> University of California, Irvine  Phone: 949-824-4508
>> Irvine, CA 92697-2025, USA
>>
>> _______________________________________________
>> lustre-discuss mailing list
>> lustre-discuss at lists.lustre.org
>> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>>
>

-- 

Dr. Nathan Crawford              nathan.crawford at uci.edu
Director of Scientific Computing
School of Physical Sciences
164 Rowland Hall                 Office: 2101 Natural Sciences II
University of California, Irvine  Phone: 949-824-4508
Irvine, CA 92697-2025, USA
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20210211/0b33a4cd/attachment.html>


More information about the lustre-discuss mailing list