[lustre-discuss] LNET IB intermittent connection

Colin Faber cfaber at gmail.com
Wed Feb 10 15:22:26 PST 2021


Hi Nathan,

Have you examined the underlying fabric to ensure it's functioning
correctly?

https://www.mellanox.com/products/adapter-software/infiniband-management-and-monitoring-tools
might interest you

-cf

On Wed, Feb 10, 2021 at 3:54 PM Nathan Crawford <nrcrawfo at uci.edu> wrote:

> Hi All,
>
>   I've recently been having a bunch of LNET over Infiniband
> connection-lost/-restored errors and am trying to find the cause and/or
> tune the system to better cope. There is a lot of stuff on the wiki (
> https://wiki.lustre.org/Lustre_Resiliency:_Understanding_Lustre_Message_Loss_and_Tuning_for_Resiliency),
> but that's from 2016, and I don't know what parts are superseded. I'm
> currently running Lustre 2.12.5 on CentOS 7.8, with a mix of Q-Logic/Intel
> QDR and Mellanox EDR HCAs and switches (using CentOS in-box RDMA/opensm).
>
>   Is there a better place to look (e.g. the fine manual, section X) for
> guidance? I've done a few searches on the Jira, but the most similar errors
> should have already been fixed in earlier releases.
>
>   Assuming that there is actually some impending hardware issue, can LNET
> be easily configured to go over the @tcp connection when the @o2ib flakes
> out?
>
> Thanks,
> Nate
>
> --
>
> Dr. Nathan Crawford              nathan.crawford at uci.edu
> Director of Scientific Computing
> School of Physical Sciences
> 164 Rowland Hall                 Office: 2101 Natural Sciences II
> University of California, Irvine  Phone: 949-824-4508
> Irvine, CA 92697-2025, USA
>
> _______________________________________________
> lustre-discuss mailing list
> lustre-discuss at lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20210210/9f07dc2a/attachment.html>


More information about the lustre-discuss mailing list