[lustre-discuss] client failing off network

Michael DiDomenico mdidomenico4 at gmail.com
Thu Oct 30 12:05:59 PDT 2025


our network is running 2.15.6 everywhere on rhel9.5, we recently built
a new machine using 2.15.7 on rhel9.6 and i'm seeing a strange
problem.  the client is ethernet connected to ten lnet routers which
bridge ethernet to infiniband.

i can mount the client just fine, read/write data, but then several
hours later, the client marks all the routers offline.  the only
recovery is to lazy unmount, lustre_rmmod, and then restart the lustre
mount

nothing unusual comes out in the journal/dmesg logs.  to lustre it
"looks" like someone pulled the network cable, but there's no evidence
that this has happened physically or even at the switch/software
layers

we upgraded two other machine to see if the problem replicates, but so
far it hasn't.  the only significant difference between the three
machines is the one with the problem has heavy container (podman)
usage, the others have zero.  i'm not sure if this is an cause or just
a red herring

any suggestions?


More information about the lustre-discuss mailing list