[lustre-discuss] LNET IB intermittent connection

Nathan Crawford nrcrawfo at uci.edu
Thu Feb 11 15:55:42 PST 2021


Hi Chris and Cory,

  I remember looking at configuring multi-rail when 2.12 came out for this
very reason, but stopped when it looked like round-robin only. Is there a
way to trick the LNet Health system into seeing one interface as "sick but
not dead"?

  Also, when is 2.14 coming out :)

  For what it's worth, the client errors I'm trying to diagnose (only one
client has them) are similar to:
[Thu Feb 11 15:51:24 2021] LustreError: 11-0:
DFS-L-OST0003-osc-ffff9cd07c339000: operation ost_set_info to node
10.201.32.48 at o2ib1 failed: rc = -107
[Thu Feb 11 15:51:24 2021] Lustre: DFS-L-OST0003-osc-ffff9cd07c339000:
Connection to DFS-L-OST0003 (at 10.201.32.48 at o2ib1) was lost; in progress
operations using this service will wait for recovery to complete
[Thu Feb 11 15:51:24 2021] LustreError: 167-0:
DFS-L-OST0003-osc-ffff9cd07c339000: This client was evicted by
DFS-L-OST0003; in progress operations using this service will fail.
[Thu Feb 11 15:51:24 2021] Lustre: DFS-L-OST0003-osc-ffff9cd07c339000:
Connection restored to 10.201.32.48 at o2ib1 (at 10.201.32.48 at o2ib1)

Thanks,
Nate

On Thu, Feb 11, 2021 at 1:25 PM Horn, Chris <chris.horn at hpe.com> wrote:

> FYI, multi-rail in 2.12 will round robin traffic between both @tcp and
> @o2ib networks. If @o2ib flakes out then traffic should shift entirely to
> @tcp, but there isn’t a way to specify that traffic go to @tcp only when
> there’s a problem with @o2ib. You need the user defined selection policy
> feature for that, and that feature is not slated to arrive until after 2.14
> (afaik).
>
>
>
> Chris Horn
>
>
>
> *From: *lustre-discuss <lustre-discuss-bounces at lists.lustre.org> on
> behalf of "Spitz, Cory James" <cory.spitz at hpe.com>
> *Date: *Thursday, February 11, 2021 at 3:17 PM
> *To: *"nathan.crawford at uci.edu" <nathan.crawford at uci.edu>, Lustre User
> Discussion Mailing List <lustre-discuss at lists.lustre.org>
> *Subject: *Re: [lustre-discuss] LNET IB intermittent connection
> *Resent-From: *<hornc at cray.com>
> *Resent-Date: *Thursday, February 11, 2021 at 3:17 PM
>
>
>
> Hi, Nate.
>
>
>
> You asked, “can LNET be easily configured to go over the @tcp connection
> when the @o2ib flakes out?”
>
>
>
> Yes, you can use LNet Multi-Rail for it and that _*is*_ covered in the
> “fine manual”, chapter 16 ☺
>
> https://doc.lustre.org/lustre_manual.xhtml#lnetmr
>
>
>
> -Cory
>
>
>
> On 2/10/21, 4:54 PM, "lustre-discuss" <
> lustre-discuss-bounces at lists.lustre.org> wrote:
>
>
>
> Hi All,
>
>
>
>   I've recently been having a bunch of LNET over Infiniband
> connection-lost/-restored errors and am trying to find the cause and/or
> tune the system to better cope. There is a lot of stuff on the wiki (
> https://wiki.lustre.org/Lustre_Resiliency:_Understanding_Lustre_Message_Loss_and_Tuning_for_Resiliency),
> but that's from 2016, and I don't know what parts are superseded. I'm
> currently running Lustre 2.12.5 on CentOS 7.8, with a mix of Q-Logic/Intel
> QDR and Mellanox EDR HCAs and switches (using CentOS in-box RDMA/opensm).
>
>
>
>   Is there a better place to look (e.g. the fine manual, section X) for
> guidance? I've done a few searches on the Jira, but the most similar errors
> should have already been fixed in earlier releases.
>
>
>
>   Assuming that there is actually some impending hardware issue, can LNET
> be easily configured to go over the @tcp connection when the @o2ib flakes
> out?
>
>
>
> Thanks,
>
> Nate
>
>
>
> --
>
> Dr. Nathan Crawford              nathan.crawford at uci.edu
>
> Director of Scientific Computing
>
> School of Physical Sciences
>
> 164 Rowland Hall                 Office: 2101 Natural Sciences II
>
> University of California, Irvine  Phone: 949-824-4508
>
> Irvine, CA 92697-2025, USA
>
>

-- 

Dr. Nathan Crawford              nathan.crawford at uci.edu
Director of Scientific Computing
School of Physical Sciences
164 Rowland Hall                 Office: 2101 Natural Sciences II
University of California, Irvine  Phone: 949-824-4508
Irvine, CA 92697-2025, USA
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20210211/708db821/attachment.html>


More information about the lustre-discuss mailing list