<div dir="ltr">Have you had a close look at the logs from your subnet manager?<div>Assuming you run Opensm on a server this is opensm.log</div><div><br></div><div><br></div><div><br></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Fri, 21 Jun 2024 at 16:35, Kurt Strosahl via lustre-discuss <<a href="mailto:lustre-discuss@lists.lustre.org" target="_blank">lustre-discuss@lists.lustre.org</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div>
<div dir="ltr">
<div style="font-family:Aptos,Aptos_EmbeddedFont,Aptos_MSFontService,Calibri,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
Good Morning,</div>
<div style="font-family:Aptos,Aptos_EmbeddedFont,Aptos_MSFontService,Calibri,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
<br>
</div>
<div style="font-family:Aptos,Aptos_EmbeddedFont,Aptos_MSFontService,Calibri,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
We've been experiencing a fairly nasty issue with our clients following our move to Alma 9. It seems to occur randomly (a few days to over a week), the clients with connectX-3 cards start getting lnet network errors and seeing moving hangs on random osts
spread across our oss systems, as well as issues talking with the mgs. This can then trigger crash cycles on the oss systems themselves (again in the lnet layer). The only answer we have found so far is to power down all the impacted clients and let the
impacted oss systems reboot.</div>
<div style="font-family:Aptos,Aptos_EmbeddedFont,Aptos_MSFontService,Calibri,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
<br>
</div>
<div style="font-family:Aptos,Aptos_EmbeddedFont,Aptos_MSFontService,Calibri,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
Here is a snippet of the error as we see it on the client:<br>
Jun21 08:16] Lustre: lustre19-OST0020-osc-ffff934c22a29800: Connection restored to 172.17.0.97@o2ib (at 172.17.0.97@o2ib) </div>
<div style="font-family:Aptos,Aptos_EmbeddedFont,Aptos_MSFontService,Calibri,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
[ +0.000006] Lustre: Skipped 2 previous similar messages</div>
<div style="font-family:Aptos,Aptos_EmbeddedFont,Aptos_MSFontService,Calibri,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
[ +3.079695] Lustre: lustre19-MDT0000-mdc-ffff934c22a29800: Connection restored to 172.17.0.37@o2ib (at 172.17.0.37@o2ib) </div>
<div style="font-family:Aptos,Aptos_EmbeddedFont,Aptos_MSFontService,Calibri,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
[ +0.223480] LustreError: 4478:0:(events.c:211:client_bulk_callback()) event type 2, status -5, desc 00000000784c6e4f</div>
<div style="font-family:Aptos,Aptos_EmbeddedFont,Aptos_MSFontService,Calibri,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
[ +0.000007] LustreError: 4478:0:(events.c:211:client_bulk_callback()) Skipped 3 previous similar messages</div>
<div style="font-family:Aptos,Aptos_EmbeddedFont,Aptos_MSFontService,Calibri,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
[ +22.955501] Lustre: 3935794:0:(client.c:2289:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1718972176/real 1718972176] req@000000008c377199 x1801581392820160/t0(0) o13->lustre24-OST0006-osc-ffff934b8f4a7000@172.17.1.42@o2ib:7/4
lens 224/368 e 0 to 1 dl 1718972183 ref 2 fl Rpc:eXQr/0/ffffffff rc 0/-1 job:'lfs.7953'</div>
<div style="font-family:Aptos,Aptos_EmbeddedFont,Aptos_MSFontService,Calibri,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
[ +0.000006] Lustre: 3935794:0:(client.c:2289:ptlrpc_expire_one_request()) Skipped 21 previous similar messages</div>
<div style="font-family:Aptos,Aptos_EmbeddedFont,Aptos_MSFontService,Calibri,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
[ +20.333921] Lustre: lustre19-OST000a-osc-ffff934c22a29800: Connection restored to 172.17.0.39@o2ib (at 172.17.0.39@o2ib) </div>
<div style="font-family:Aptos,Aptos_EmbeddedFont,Aptos_MSFontService,Calibri,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
[Jun21 08:17] LustreError: 166-1: MGC172.17.0.36@o2ib: Connection to MGS (at 172.17.0.37@o2ib) was lost; in progress operations using this service will fail </div>
<div style="font-family:Aptos,Aptos_EmbeddedFont,Aptos_MSFontService,Calibri,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
[ +0.000302] Lustre: lustre19-OST0046-osc-ffff934c22a29800: Connection to lustre19-OST0046 (at 172.17.0.103@o2ib) was lost; in progress operations using this service will wait for recovery to complete </div>
<div style="font-family:Aptos,Aptos_EmbeddedFont,Aptos_MSFontService,Calibri,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
[ +0.000005] Lustre: Skipped 6 previous similar messages</div>
<div style="font-family:Aptos,Aptos_EmbeddedFont,Aptos_MSFontService,Calibri,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
[ +6.144196] Lustre: MGC172.17.0.36@o2ib: Connection restored to 172.17.0.37@o2ib (at 172.17.0.37@o2ib)</div>
<div style="font-family:Aptos,Aptos_EmbeddedFont,Aptos_MSFontService,Calibri,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
[ +0.000006] Lustre: Skipped 1 previous similar message</div>
<div style="font-family:Aptos,Aptos_EmbeddedFont,Aptos_MSFontService,Calibri,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
<br>
</div>
<div style="font-family:Aptos,Aptos_EmbeddedFont,Aptos_MSFontService,Calibri,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
We have a mix of client hardware, but the systems are uniform in their kernels and lustre clients.</div>
<div style="font-family:Aptos,Aptos_EmbeddedFont,Aptos_MSFontService,Calibri,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
<br>
</div>
<div style="font-family:Aptos,Aptos_EmbeddedFont,Aptos_MSFontService,Calibri,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
Here are the software versions:<br>
kernel-modules-core-5.14.0-362.24.1.el9_3.x86_64</div>
<div style="font-family:Aptos,Aptos_EmbeddedFont,Aptos_MSFontService,Calibri,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
kernel-core-5.14.0-362.24.1.el9_3.x86_64</div>
<div style="font-family:Aptos,Aptos_EmbeddedFont,Aptos_MSFontService,Calibri,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
kernel-modules-5.14.0-362.24.1.el9_3.x86_64</div>
<div style="font-family:Aptos,Aptos_EmbeddedFont,Aptos_MSFontService,Calibri,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
kernel-5.14.0-362.24.1.el9_3.x86_64</div>
<div style="font-family:Aptos,Aptos_EmbeddedFont,Aptos_MSFontService,Calibri,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
texlive-l3kernel-20200406-26.el9_2.noarch</div>
<div style="font-family:Aptos,Aptos_EmbeddedFont,Aptos_MSFontService,Calibri,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
kernel-modules-core-5.14.0-362.24.2.el9_3.x86_64</div>
<div style="font-family:Aptos,Aptos_EmbeddedFont,Aptos_MSFontService,Calibri,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
kernel-core-5.14.0-362.24.2.el9_3.x86_64</div>
<div style="font-family:Aptos,Aptos_EmbeddedFont,Aptos_MSFontService,Calibri,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
kernel-modules-5.14.0-362.24.2.el9_3.x86_64</div>
<div style="font-family:Aptos,Aptos_EmbeddedFont,Aptos_MSFontService,Calibri,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
kernel-tools-libs-5.14.0-362.24.2.el9_3.x86_64</div>
<div style="font-family:Aptos,Aptos_EmbeddedFont,Aptos_MSFontService,Calibri,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
kernel-tools-5.14.0-362.24.2.el9_3.x86_64</div>
<div style="font-family:Aptos,Aptos_EmbeddedFont,Aptos_MSFontService,Calibri,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
kernel-5.14.0-362.24.2.el9_3.x86_64</div>
<div style="font-family:Aptos,Aptos_EmbeddedFont,Aptos_MSFontService,Calibri,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
kernel-headers-5.14.0-362.24.2.el9_3.x86_64</div>
<div style="font-family:Aptos,Aptos_EmbeddedFont,Aptos_MSFontService,Calibri,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
<br>
</div>
<div style="font-family:Aptos,Aptos_EmbeddedFont,Aptos_MSFontService,Calibri,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
and lustre:</div>
<div style="font-family:Aptos,Aptos_EmbeddedFont,Aptos_MSFontService,Calibri,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
kmod-lustre-client-2.15.4-1.el9.jlab.x86_64</div>
<div style="font-family:Aptos,Aptos_EmbeddedFont,Aptos_MSFontService,Calibri,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
lustre-client-2.15.4-1.el9.jlab.x86_64</div>
<div style="font-family:Aptos,Aptos_EmbeddedFont,Aptos_MSFontService,Calibri,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
<br>
</div>
<div style="font-family:Aptos,Aptos_EmbeddedFont,Aptos_MSFontService,Calibri,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
Our oss systems are running el7, are running MOFED for their infiniband stack, and have ConnectX-3 cards</div>
<div style="font-family:Aptos,Aptos_EmbeddedFont,Aptos_MSFontService,Calibri,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
kernel-tools-libs-3.10.0-1160.76.1.el7.x86_64</div>
<div style="font-family:Aptos,Aptos_EmbeddedFont,Aptos_MSFontService,Calibri,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
kernel-tools-3.10.0-1160.76.1.el7.x86_64</div>
<div style="font-family:Aptos,Aptos_EmbeddedFont,Aptos_MSFontService,Calibri,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
kernel-headers-3.10.0-1160.76.1.el7.x86_64</div>
<div style="font-family:Aptos,Aptos_EmbeddedFont,Aptos_MSFontService,Calibri,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
kernel-abi-whitelists-3.10.0-1160.76.1.el7.noarch</div>
<div style="font-family:Aptos,Aptos_EmbeddedFont,Aptos_MSFontService,Calibri,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
kernel-devel-3.10.0-1160.76.1.el7.x86_64</div>
<div style="font-family:Aptos,Aptos_EmbeddedFont,Aptos_MSFontService,Calibri,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
kernel-3.10.0-1160.76.1.el7.x86_64</div>
<div style="font-family:Aptos,Aptos_EmbeddedFont,Aptos_MSFontService,Calibri,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
<br>
</div>
<div style="font-family:Aptos,Aptos_EmbeddedFont,Aptos_MSFontService,Calibri,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
and lustre version</div>
<div style="font-family:Aptos,Aptos_EmbeddedFont,Aptos_MSFontService,Calibri,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
lustre-2.12.9-1.el7.x86_64</div>
<div style="font-family:Aptos,Aptos_EmbeddedFont,Aptos_MSFontService,Calibri,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
kmod-lustre-osd-zfs-2.12.9-1.el7.x86_64</div>
<div style="font-family:Aptos,Aptos_EmbeddedFont,Aptos_MSFontService,Calibri,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
lustre-osd-zfs-mount-2.12.9-1.el7.x86_64</div>
<div style="font-family:Aptos,Aptos_EmbeddedFont,Aptos_MSFontService,Calibri,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
lustre-resource-agents-2.12.9-1.el7.x86_64</div>
<div style="font-family:Aptos,Aptos_EmbeddedFont,Aptos_MSFontService,Calibri,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
kmod-lustre-2.12.9-1.el7.x86_64</div>
<div style="font-family:Aptos,Aptos_EmbeddedFont,Aptos_MSFontService,Calibri,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
<br>
</div>
<div style="font-family:Aptos,Aptos_EmbeddedFont,Aptos_MSFontService,Calibri,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
w/r,</div>
<div id="m_-6610091034798461561m_-2186441775380847908Signature" style="color:inherit">
<div style="font-size:12pt;color:rgb(0,0,0);font-family:Calibri,Helvetica,sans-serif" dir="ltr" id="m_-6610091034798461561m_-2186441775380847908divtagdefaultwrapper">
<p style="margin-top:0px;margin-bottom:0px"><span style="font-family:monospace;font-size:14.16px;color:rgb(51,51,51)">Kurt J. Strosahl (he/him)</span><br>
<span style="font-family:monospace;font-size:14.16px;color:rgb(51,51,51)">System Administrator: Lustre, HPC</span><br>
<span style="font-family:monospace;font-size:14.16px;color:rgb(51,51,51)">Scientific Computing Group, Thomas Jefferson National Accelerator Facility</span><br>
</p>
</div>
</div>
</div>
_______________________________________________<br>
lustre-discuss mailing list<br>
<a href="mailto:lustre-discuss@lists.lustre.org" target="_blank">lustre-discuss@lists.lustre.org</a><br>
<a href="http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org" rel="noreferrer" target="_blank">http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org</a><br>
</div></blockquote></div>