[lustre-discuss] LNet nid down after some thing changed the NICs
Horn, Chris
chris.horn at hpe.com
Wed Mar 1 09:15:26 PST 2023
Hi CJ,
I don’t know if you ever got an account and ticket opened, but I stumbled upon this change which sounds like it could be your issue - https://jira.whamcloud.com/browse/LU-16378
commit 3c9282a67d73799a03cb1d254275685c1c1e4df2
Author: Cyril Bordage cbordage at whamcloud.com<mailto:cbordage at whamcloud.com>
Date: Sat Dec 10 01:51:16 2022 +0100
LU-16378 lnet: handles unregister/register events
When network is restarted, devices are unregistered and then
registered again. When a device registers using an index that is
different from the previous one (before network was restarted), LNet
ignores it. Consequently, this device stays with link in fatal state.
To fix that, we catch unregistering events to clear the saved index
value, and when a registering event comes, we save the new value.
Chris Horn
From: CJ Yin <woshifuxiuyin at gmail.com>
Date: Sunday, February 19, 2023 at 12:23 AM
To: Horn, Chris <chris.horn at hpe.com>
Cc: lustre-discuss at lists.lustre.org <lustre-discuss at lists.lustre.org>
Subject: Re: [lustre-discuss] LNet nid down after some thing changed the NICs
Hi Chris,
Thanks for your help. I have collected the relevant logs according to your hints. But I need an account to open a ticket on Jira. I have sent an email to the administrator at info at whamcloud.com<mailto:info at whamcloud.com>. I was wondering if this is the correct way to apply for an account. I only found this email on the site.
Regards,
Chuanjun
Horn, Chris <chris.horn at hpe.com<mailto:chris.horn at hpe.com>> 于2023年2月18日周六 00:52写道:
If deleting and re-adding it restores the status to up then this sounds like a bug to me.
Can you enable debug tracing, reproduce the issue, and add this information to a ticket?
To enable/gather debug:
# lctl set_param debug=+net
<reproduce issue>
# lctl dk > /tmp/dk.log
You can create a ticket at https://jira.whamcloud.com/<https://jira.whamcloud.com/>
Please provide the dk.log with the ticket.
Thanks,
Chris Horn
From: lustre-discuss <lustre-discuss-bounces at lists.lustre.org<mailto:lustre-discuss-bounces at lists.lustre.org>> on behalf of 腐朽银 via lustre-discuss <lustre-discuss at lists.lustre.org<mailto:lustre-discuss at lists.lustre.org>>
Date: Friday, February 17, 2023 at 2:53 AM
To: lustre-discuss at lists.lustre.org<mailto:lustre-discuss at lists.lustre.org> <lustre-discuss at lists.lustre.org<mailto:lustre-discuss at lists.lustre.org>>
Subject: [lustre-discuss] LNet nid down after some thing changed the NICs
Hi,
I encountered a problem when using Lustre Client on k8s with kubenet. Very happy if you could help me.
My LNet configuration is:
net:
- net type: lo
local NI(s):
- nid: 0 at lo
status: up
- net type: tcp
local NI(s):
- nid: 10.224.0.5 at tcp
status: up
interfaces:
0: eth0
It works. But after I deploy or delete a pod on the node. The nid goes down like:
- nid: 10.224.0.5 at tcp
status: down
interfaces:
0: eth0
k8s uses veth pairs, so it will add or delete network interfaces when deploying or deleting pods. But it doesn't touch the eth0 NIC. I can fix it by deleting the tcp net by `lnetctl net del` and re-add it by `lnetctl net add`. But I need to do this every time after a pod is scheduled to this node.
My node OS is Ubuntu 18.04 5.4.0-1101-azure. The Lustre Client is built by myself from 2.15.1. Is this an expected LNet behavior or I got something wrong? I re-build and tested it several times and got the same problem.
Regards,
Chuanjun
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20230301/4a24c293/attachment.htm>
More information about the lustre-discuss
mailing list