[lustre-discuss] LNet nid down after some thing changed the NICs

Wed Mar 1 09:15:26 PST 2023

Hi CJ,

I don’t know if you ever got an account and ticket opened, but I stumbled upon this change which sounds like it could be your issue - https://jira.whamcloud.com/browse/LU-16378

commit 3c9282a67d73799a03cb1d254275685c1c1e4df2
Author: Cyril Bordage cbordage at whamcloud.com<mailto:cbordage at whamcloud.com>
Date:   Sat Dec 10 01:51:16 2022 +0100

    LU-16378 lnet: handles unregister/register events

    When network is restarted, devices are unregistered and then
   registered again. When a device registers using an index that is
    different from the previous one (before network was restarted), LNet
    ignores it. Consequently, this device stays with link in fatal state.

    To fix that, we catch unregistering events to clear the saved index
    value, and when a registering event comes, we save the new value.

Chris Horn

From: CJ Yin <woshifuxiuyin at gmail.com>
Date: Sunday, February 19, 2023 at 12:23 AM
To: Horn, Chris <chris.horn at hpe.com>
Cc: lustre-discuss at lists.lustre.org <lustre-discuss at lists.lustre.org>
Subject: Re: [lustre-discuss] LNet nid down after some thing changed the NICs
Hi Chris,

Thanks for your help. I have collected the relevant logs according to your hints. But I need an account to open a ticket on Jira. I have sent an email to the administrator at info at whamcloud.com<mailto:info at whamcloud.com>. I was wondering if this is the correct way to apply for an account. I only found this email on the site.

Regards,
Chuanjun

Horn, Chris <chris.horn at hpe.com<mailto:chris.horn at hpe.com>> 于2023年2月18日周六 00:52写道：
If deleting and re-adding it restores the status to up then this sounds like a bug to me.

Can you enable debug tracing, reproduce the issue, and add this information to a ticket?

To enable/gather debug:

# lctl set_param debug=+net
<reproduce issue>
# lctl dk > /tmp/dk.log

You can create a ticket at https://jira.whamcloud.com/<https://jira.whamcloud.com/>

Please provide the dk.log with the ticket.

Thanks,
Chris Horn

From: lustre-discuss <lustre-discuss-bounces at lists.lustre.org<mailto:lustre-discuss-bounces at lists.lustre.org>> on behalf of 腐朽银 via lustre-discuss <lustre-discuss at lists.lustre.org<mailto:lustre-discuss at lists.lustre.org>>
Date: Friday, February 17, 2023 at 2:53 AM
To: lustre-discuss at lists.lustre.org<mailto:lustre-discuss at lists.lustre.org> <lustre-discuss at lists.lustre.org<mailto:lustre-discuss at lists.lustre.org>>
Subject: [lustre-discuss] LNet nid down after some thing changed the NICs
Hi,

I encountered a problem when using Lustre Client on k8s with kubenet. Very happy if you could help me.

My LNet configuration is:

net:
    - net type: lo
      local NI(s):
        - nid: 0 at lo
          status: up
    - net type: tcp
      local NI(s):
        - nid: 10.224.0.5 at tcp
          status: up
          interfaces:
              0: eth0

It works. But after I deploy or delete a pod on the node. The nid goes down like:

- nid: 10.224.0.5 at tcp
          status: down
          interfaces:
              0: eth0

k8s uses veth pairs, so it will add or delete network interfaces when deploying or deleting pods. But it doesn't touch the eth0 NIC. I can fix it by deleting the tcp net by `lnetctl net del` and re-add it by `lnetctl net add`. But I need to do this every time after a pod is scheduled to this node.

My node OS is Ubuntu 18.04 5.4.0-1101-azure. The Lustre Client is built by myself from 2.15.1. Is this an expected LNet behavior or I got something wrong? I re-build and tested it several times and got the same problem.

Regards,
Chuanjun
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20230301/4a24c293/attachment.htm>