[lustre-discuss] LNET IB intermittent connection

Nathan Crawford nrcrawfo at uci.edu
Wed Feb 10 14:53:35 PST 2021


Hi All,

  I've recently been having a bunch of LNET over Infiniband
connection-lost/-restored errors and am trying to find the cause and/or
tune the system to better cope. There is a lot of stuff on the wiki (
https://wiki.lustre.org/Lustre_Resiliency:_Understanding_Lustre_Message_Loss_and_Tuning_for_Resiliency),
but that's from 2016, and I don't know what parts are superseded. I'm
currently running Lustre 2.12.5 on CentOS 7.8, with a mix of Q-Logic/Intel
QDR and Mellanox EDR HCAs and switches (using CentOS in-box RDMA/opensm).

  Is there a better place to look (e.g. the fine manual, section X) for
guidance? I've done a few searches on the Jira, but the most similar errors
should have already been fixed in earlier releases.

  Assuming that there is actually some impending hardware issue, can LNET
be easily configured to go over the @tcp connection when the @o2ib flakes
out?

Thanks,
Nate

-- 

Dr. Nathan Crawford              nathan.crawford at uci.edu
Director of Scientific Computing
School of Physical Sciences
164 Rowland Hall                 Office: 2101 Natural Sciences II
University of California, Irvine  Phone: 949-824-4508
Irvine, CA 92697-2025, USA
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20210210/db1cc4c8/attachment.html>


More information about the lustre-discuss mailing list