[lustre-discuss] LNET IB intermittent connection
Horn, Chris
chris.horn at hpe.com
Thu Feb 11 13:34:54 PST 2021
FYI, multi-rail in 2.12 will round robin traffic between both @tcp and @o2ib networks (assuming peers are reachable on both). If @o2ib flakes out then traffic should shift entirely to @tcp, but there isn’t a way to specify that traffic go to @tcp only when there’s a problem with @o2ib. You need the user defined selection policy feature for that, and that feature is not slated to arrive until after 2.14 (afaik).
Chris Horn
From: lustre-discuss <lustre-discuss-bounces at lists.lustre.org> on behalf of "Spitz, Cory James" <cory.spitz at hpe.com>
Date: Thursday, February 11, 2021 at 3:17 PM
To: "nathan.crawford at uci.edu" <nathan.crawford at uci.edu>, Lustre User Discussion Mailing List <lustre-discuss at lists.lustre.org>
Subject: Re: [lustre-discuss] LNET IB intermittent connection
Resent-From: <hornc at cray.com>
Resent-Date: Thursday, February 11, 2021 at 3:17 PM
Hi, Nate.
You asked, “can LNET be easily configured to go over the @tcp connection when the @o2ib flakes out?”
Yes, you can use LNet Multi-Rail for it and that _is_ covered in the “fine manual”, chapter 16 ☺
https://doc.lustre.org/lustre_manual.xhtml#lnetmr<https://doc.lustre.org/lustre_manual.xhtml#lnetmr>
-Cory
On 2/10/21, 4:54 PM, "lustre-discuss" <lustre-discuss-bounces at lists.lustre.org> wrote:
Hi All,
I've recently been having a bunch of LNET over Infiniband connection-lost/-restored errors and am trying to find the cause and/or tune the system to better cope. There is a lot of stuff on the wiki ( https://wiki.lustre.org/Lustre_Resiliency:_Understanding_Lustre_Message_Loss_and_Tuning_for_Resiliency<https://wiki.lustre.org/Lustre_Resiliency:_Understanding_Lustre_Message_Loss_and_Tuning_for_Resiliency>), but that's from 2016, and I don't know what parts are superseded. I'm currently running Lustre 2.12.5 on CentOS 7.8, with a mix of Q-Logic/Intel QDR and Mellanox EDR HCAs and switches (using CentOS in-box RDMA/opensm).
Is there a better place to look (e.g. the fine manual, section X) for guidance? I've done a few searches on the Jira, but the most similar errors should have already been fixed in earlier releases.
Assuming that there is actually some impending hardware issue, can LNET be easily configured to go over the @tcp connection when the @o2ib flakes out?
Thanks,
Nate
--
Dr. Nathan Crawford nathan.crawford at uci.edu<mailto:nathan.crawford at uci.edu>
Director of Scientific Computing
School of Physical Sciences
164 Rowland Hall Office: 2101 Natural Sciences II
University of California, Irvine Phone: 949-824-4508
Irvine, CA 92697-2025, USA
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20210211/3c94e8b4/attachment.html>
More information about the lustre-discuss
mailing list