[lustre-discuss] lnet instability over infiniband when running el9 + connextX-3 hardware
Kurt Strosahl
strosahl at jlab.org
Fri Jun 21 08:34:14 PDT 2024
Good Morning,
We've been experiencing a fairly nasty issue with our clients following our move to Alma 9. It seems to occur randomly (a few days to over a week), the clients with connectX-3 cards start getting lnet network errors and seeing moving hangs on random osts spread across our oss systems, as well as issues talking with the mgs. This can then trigger crash cycles on the oss systems themselves (again in the lnet layer). The only answer we have found so far is to power down all the impacted clients and let the impacted oss systems reboot.
Here is a snippet of the error as we see it on the client:
Jun21 08:16] Lustre: lustre19-OST0020-osc-ffff934c22a29800: Connection restored to 172.17.0.97 at o2ib (at 172.17.0.97 at o2ib)
[ +0.000006] Lustre: Skipped 2 previous similar messages
[ +3.079695] Lustre: lustre19-MDT0000-mdc-ffff934c22a29800: Connection restored to 172.17.0.37 at o2ib (at 172.17.0.37 at o2ib)
[ +0.223480] LustreError: 4478:0:(events.c:211:client_bulk_callback()) event type 2, status -5, desc 00000000784c6e4f
[ +0.000007] LustreError: 4478:0:(events.c:211:client_bulk_callback()) Skipped 3 previous similar messages
[ +22.955501] Lustre: 3935794:0:(client.c:2289:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1718972176/real 1718972176] req at 000000008c377199 x1801581392820160/t0(0) o13->lustre24-OST0006-osc-ffff934b8f4a7000 at 172.17.1.42@o2ib:7/4 lens 224/368 e 0 to 1 dl 1718972183 ref 2 fl Rpc:eXQr/0/ffffffff rc 0/-1 job:'lfs.7953'
[ +0.000006] Lustre: 3935794:0:(client.c:2289:ptlrpc_expire_one_request()) Skipped 21 previous similar messages
[ +20.333921] Lustre: lustre19-OST000a-osc-ffff934c22a29800: Connection restored to 172.17.0.39 at o2ib (at 172.17.0.39 at o2ib)
[Jun21 08:17] LustreError: 166-1: MGC172.17.0.36 at o2ib: Connection to MGS (at 172.17.0.37 at o2ib) was lost; in progress operations using this service will fail
[ +0.000302] Lustre: lustre19-OST0046-osc-ffff934c22a29800: Connection to lustre19-OST0046 (at 172.17.0.103 at o2ib) was lost; in progress operations using this service will wait for recovery to complete
[ +0.000005] Lustre: Skipped 6 previous similar messages
[ +6.144196] Lustre: MGC172.17.0.36 at o2ib: Connection restored to 172.17.0.37 at o2ib (at 172.17.0.37 at o2ib)
[ +0.000006] Lustre: Skipped 1 previous similar message
We have a mix of client hardware, but the systems are uniform in their kernels and lustre clients.
Here are the software versions:
kernel-modules-core-5.14.0-362.24.1.el9_3.x86_64
kernel-core-5.14.0-362.24.1.el9_3.x86_64
kernel-modules-5.14.0-362.24.1.el9_3.x86_64
kernel-5.14.0-362.24.1.el9_3.x86_64
texlive-l3kernel-20200406-26.el9_2.noarch
kernel-modules-core-5.14.0-362.24.2.el9_3.x86_64
kernel-core-5.14.0-362.24.2.el9_3.x86_64
kernel-modules-5.14.0-362.24.2.el9_3.x86_64
kernel-tools-libs-5.14.0-362.24.2.el9_3.x86_64
kernel-tools-5.14.0-362.24.2.el9_3.x86_64
kernel-5.14.0-362.24.2.el9_3.x86_64
kernel-headers-5.14.0-362.24.2.el9_3.x86_64
and lustre:
kmod-lustre-client-2.15.4-1.el9.jlab.x86_64
lustre-client-2.15.4-1.el9.jlab.x86_64
Our oss systems are running el7, are running MOFED for their infiniband stack, and have ConnectX-3 cards
kernel-tools-libs-3.10.0-1160.76.1.el7.x86_64
kernel-tools-3.10.0-1160.76.1.el7.x86_64
kernel-headers-3.10.0-1160.76.1.el7.x86_64
kernel-abi-whitelists-3.10.0-1160.76.1.el7.noarch
kernel-devel-3.10.0-1160.76.1.el7.x86_64
kernel-3.10.0-1160.76.1.el7.x86_64
and lustre version
lustre-2.12.9-1.el7.x86_64
kmod-lustre-osd-zfs-2.12.9-1.el7.x86_64
lustre-osd-zfs-mount-2.12.9-1.el7.x86_64
lustre-resource-agents-2.12.9-1.el7.x86_64
kmod-lustre-2.12.9-1.el7.x86_64
w/r,
Kurt J. Strosahl (he/him)
System Administrator: Lustre, HPC
Scientific Computing Group, Thomas Jefferson National Accelerator Facility
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20240621/c12ad1fd/attachment.htm>
More information about the lustre-discuss
mailing list