[lustre-discuss] client failing off network

Michael DiDomenico mdidomenico4 at gmail.com
Thu Oct 30 12:43:07 PDT 2025


thanks i'll see if the printk turns up anything

no, lnet pings from client->router dont work once the client is
knocked offline, regular ping via ethernet does work fine.  and for
the record the lnet routers are not going down, i have hundreds of
other machines connected without issue


On Thu, Oct 30, 2025 at 7:34 PM Horn, Chris <chris.horn at hpe.com> wrote:
>
> As a further troubleshooting step, I would suggest enabling neterror in the printk mask on the client and LNet routers:
>
> lctl set_param printk=+neterror
>
> This may surface additional information around the routes going down.
>
> Another thing you ought to try is checking connectivity between client and routers after the routes get marked down. Do pings over the LNet interface work?
>
> ping -I <client_ip> <router_ip>
> lnetctl ping --source <client_nid> <router_nid>
>
> There were only a handful of LNet changes, so it is unlikely to be some regression in LNet.
>
> > git -P le 2.15.7 ^2.15.6 lnet
> 17fc6dbcd6 LU-17784 build: improve wiretest for flexible arrays
> 8535cfe29a LU-18572 lnet: Uninitialized var in lnet_peer_add
> c00bb50624 LU-18697 lnet: lnet_peer_del_nid refcount loss
> 9d8dbed27c LU-16594 build: get_random_u32_below, get_acl with dentry
> 247ae64877 LU-17081 build: compatibility for 6.5 kernels
> >
>
> Chris Horn
>
> From: lustre-discuss <lustre-discuss-bounces at lists.lustre.org> on behalf of Michael DiDomenico via lustre-discuss <lustre-discuss at lists.lustre.org>
> Date: Thursday, October 30, 2025 at 2:08 PM
> To: lustre-discuss <lustre-discuss at lists.lustre.org>
> Subject: [lustre-discuss] client failing off network
>
> our network is running 2.15.6 everywhere on rhel9.5, we recently built
> a new machine using 2.15.7 on rhel9.6 and i'm seeing a strange
> problem.  the client is ethernet connected to ten lnet routers which
> bridge ethernet to infiniband.
>
> i can mount the client just fine, read/write data, but then several
> hours later, the client marks all the routers offline.  the only
> recovery is to lazy unmount, lustre_rmmod, and then restart the lustre
> mount
>
> nothing unusual comes out in the journal/dmesg logs.  to lustre it
> "looks" like someone pulled the network cable, but there's no evidence
> that this has happened physically or even at the switch/software
> layers
>
> we upgraded two other machine to see if the problem replicates, but so
> far it hasn't.  the only significant difference between the three
> machines is the one with the problem has heavy container (podman)
> usage, the others have zero.  i'm not sure if this is an cause or just
> a red herring
>
> any suggestions?
> _______________________________________________
> lustre-discuss mailing list
> lustre-discuss at lists.lustre.org
> https://urldefense.com/v3/__http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org__;!!NpxR!ibXFDE5f0Z10bD2MkR6l2DaJMCZpX6tzg8uJXOztC1mZt_r7Or5inWyefgVRAv10RUkPfLDg73fzg3o7ppoMYTibfHs2$


More information about the lustre-discuss mailing list