[lustre-discuss] client failing off network
Horn, Chris
chris.horn at hpe.com
Thu Oct 30 12:33:53 PDT 2025
As a further troubleshooting step, I would suggest enabling neterror in the printk mask on the client and LNet routers:
lctl set_param printk=+neterror
This may surface additional information around the routes going down.
Another thing you ought to try is checking connectivity between client and routers after the routes get marked down. Do pings over the LNet interface work?
ping -I <client_ip> <router_ip>
lnetctl ping --source <client_nid> <router_nid>
There were only a handful of LNet changes, so it is unlikely to be some regression in LNet.
> git -P le 2.15.7 ^2.15.6 lnet
17fc6dbcd6 LU-17784 build: improve wiretest for flexible arrays
8535cfe29a LU-18572 lnet: Uninitialized var in lnet_peer_add
c00bb50624 LU-18697 lnet: lnet_peer_del_nid refcount loss
9d8dbed27c LU-16594 build: get_random_u32_below, get_acl with dentry
247ae64877 LU-17081 build: compatibility for 6.5 kernels
>
Chris Horn
From: lustre-discuss <lustre-discuss-bounces at lists.lustre.org> on behalf of Michael DiDomenico via lustre-discuss <lustre-discuss at lists.lustre.org>
Date: Thursday, October 30, 2025 at 2:08 PM
To: lustre-discuss <lustre-discuss at lists.lustre.org>
Subject: [lustre-discuss] client failing off network
our network is running 2.15.6 everywhere on rhel9.5, we recently built
a new machine using 2.15.7 on rhel9.6 and i'm seeing a strange
problem. the client is ethernet connected to ten lnet routers which
bridge ethernet to infiniband.
i can mount the client just fine, read/write data, but then several
hours later, the client marks all the routers offline. the only
recovery is to lazy unmount, lustre_rmmod, and then restart the lustre
mount
nothing unusual comes out in the journal/dmesg logs. to lustre it
"looks" like someone pulled the network cable, but there's no evidence
that this has happened physically or even at the switch/software
layers
we upgraded two other machine to see if the problem replicates, but so
far it hasn't. the only significant difference between the three
machines is the one with the problem has heavy container (podman)
usage, the others have zero. i'm not sure if this is an cause or just
a red herring
any suggestions?
_______________________________________________
lustre-discuss mailing list
lustre-discuss at lists.lustre.org
https://urldefense.com/v3/__http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org__;!!NpxR!ibXFDE5f0Z10bD2MkR6l2DaJMCZpX6tzg8uJXOztC1mZt_r7Or5inWyefgVRAv10RUkPfLDg73fzg3o7ppoMYTibfHs2$
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20251030/1301c559/attachment-0001.htm>
More information about the lustre-discuss
mailing list