<!DOCTYPE html>
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
</head>
<body>
Hi, <br>
<br>
We just had a similar issue on 2.15.5. Infiniband clients not
reconnecting after a target outage.<br>
<br>
Deleting the LNet net and importing the config again solved it
without reboot and unmount:<br>
<br>
# letctl net del --net 02ib<br>
# lnetctl import < /etc/lnet.conf<br>
<br>
Cheers,<br>
Hans Henrik<br>
<br>
<div class="moz-cite-prefix">On 28/08/2024 18.18, Lixin Liu via
lustre-discuss wrote:<br>
</div>
<blockquote type="cite"
cite="mid:6DAF041D-5153-461D-8272-826E37A5A13C@sfu.ca">
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<meta name="Generator"
content="Microsoft Word 15 (filtered medium)">
<style>@font-face
{font-family:"Cambria Math";
panose-1:2 4 5 3 5 4 6 3 2 4;}@font-face
{font-family:DengXian;
panose-1:2 1 6 0 3 1 1 1 1 1;}@font-face
{font-family:Calibri;
panose-1:2 15 5 2 2 2 4 3 2 4;}@font-face
{font-family:Aptos;
panose-1:2 11 0 4 2 2 2 2 2 4;}@font-face
{font-family:Consolas;
panose-1:2 11 6 9 2 2 4 3 2 4;}@font-face
{font-family:"\@DengXian";
panose-1:2 1 6 0 3 1 1 1 1 1;}p.MsoNormal, li.MsoNormal, div.MsoNormal
{margin:0cm;
font-size:11.0pt;
font-family:"Aptos",sans-serif;}a:link, span.MsoHyperlink
{mso-style-priority:99;
color:blue;
text-decoration:underline;}pre
{mso-style-priority:99;
mso-style-link:"HTML Preformatted Char";
margin:0cm;
font-size:10.0pt;
font-family:"Courier New";}p.MsoListParagraph, li.MsoListParagraph, div.MsoListParagraph
{mso-style-priority:34;
margin-top:0cm;
margin-right:0cm;
margin-bottom:0cm;
margin-left:36.0pt;
font-size:11.0pt;
font-family:"Aptos",sans-serif;}span.HTMLPreformattedChar
{mso-style-name:"HTML Preformatted Char";
mso-style-priority:99;
mso-style-link:"HTML Preformatted";
font-family:Consolas;}span.EmailStyle24
{mso-style-type:personal-reply;
font-family:"Aptos",sans-serif;
color:windowtext;}.MsoChpDefault
{mso-style-type:export-only;
font-size:10.0pt;
mso-ligatures:none;}div.WordSection1
{page:WordSection1;}ol
{margin-bottom:0cm;}ul
{margin-bottom:0cm;}</style>
<div class="WordSection1">
<p class="MsoNormal">We had the same problem after we upgraded
Lustre servers from 2.12.8 to 2.15.3.
<o:p></o:p></p>
<p class="MsoNormal">Clients were running 2.15.3 on CentOS 7.
Random OST dropped out frequently on<o:p></o:p></p>
<p class="MsoNormal">busy login nodes (almost daily), but less
so on compute nodes. “lctl” command<o:p></o:p></p>
<p class="MsoNormal">cannot active OSTs and reboot we the only
way to clear the problem.<o:p></o:p></p>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal">In June, we upgraded all client OS to
AlmaLinux 9.3 and Lustre version to 2.15.4 on<o:p></o:p></p>
<p class="MsoNormal">both servers and clients (missed 2.15.5
release by about 2 weeks). After the upgrade,<o:p></o:p></p>
<p class="MsoNormal">we no longer have this problem.<o:p></o:p></p>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal">In our case, I wonder this was OmniPath
related. Servers on AlamLinux 8 was using<o:p></o:p></p>
<p class="MsoNormal">in kernel driver, but CentOS 7 clients are
using driver from Intel/Cornelis release.<o:p></o:p></p>
<p class="MsoNormal">Alma 9 clients are now also using in kernel
driver.<o:p></o:p></p>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal">Cheers,<o:p></o:p></p>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal">Lixin.<o:p></o:p></p>
<p class="MsoNormal"><o:p> </o:p></p>
<div
style="border:none;border-top:solid #B5C4DF 1.0pt;padding:3.0pt 0cm 0cm 0cm">
<p class="MsoNormal"><b><span
style="font-family:"Calibri",sans-serif;color:black">From:
</span></b><span
style="font-family:"Calibri",sans-serif;color:black">lustre-discuss
<a class="moz-txt-link-rfc2396E" href="mailto:lustre-discuss-bounces@lists.lustre.org"><lustre-discuss-bounces@lists.lustre.org></a> on behalf
of Cameron Harr via lustre-discuss
<a class="moz-txt-link-rfc2396E" href="mailto:lustre-discuss@lists.lustre.org"><lustre-discuss@lists.lustre.org></a><br>
<b>Reply-To: </b>Cameron Harr <a class="moz-txt-link-rfc2396E" href="mailto:harr1@llnl.gov"><harr1@llnl.gov></a><br>
<b>Date: </b>Wednesday, August 28, 2024 at 8:19 AM<br>
<b>To: </b><a class="moz-txt-link-rfc2396E" href="mailto:lustre-discuss@lists.lustre.org">"lustre-discuss@lists.lustre.org"</a>
<a class="moz-txt-link-rfc2396E" href="mailto:lustre-discuss@lists.lustre.org"><lustre-discuss@lists.lustre.org></a><br>
<b>Subject: </b>Re: [lustre-discuss] How to activate an
OST on a client ?</span><span
style="font-size:12.0pt;font-family:"Calibri",sans-serif;color:black"><o:p></o:p></span></p>
</div>
<div>
<p class="MsoNormal"><o:p> </o:p></p>
</div>
<p>There's also an "lctl --device <dev> activate" that
I've used in the past though I don't know what conditions need
to be for it to work.<o:p></o:p></p>
<div>
<p class="MsoNormal">On 8/27/24 07:46, Andreas Dilger via
lustre-discuss wrote:<o:p></o:p></p>
</div>
<blockquote style="margin-top:5.0pt;margin-bottom:5.0pt">
<p class="MsoNormal">Hi Jan, <o:p></o:p></p>
<div>
<p class="MsoNormal">There is "lctl --device XXXX recover"
that will trigger a reconnect to the named OST device (per
"lctl dl" output), but not sure if that will help. <o:p></o:p></p>
</div>
<div>
<p class="MsoNormal"><o:p> </o:p></p>
</div>
<div>
<p class="MsoNormal"><o:p> </o:p></p>
<div>
<p class="MsoNormal">Cheers, Andreas<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal"><br>
<br>
<o:p></o:p></p>
<blockquote style="margin-top:5.0pt;margin-bottom:5.0pt">
<p class="MsoNormal" style="margin-bottom:12.0pt">On Aug
22, 2024, at 06:36, Haarst, Jan van via lustre-discuss
<a href="mailto:lustre-discuss@lists.lustre.org"
moz-do-not-send="true"><lustre-discuss@lists.lustre.org></a>
wrote:<o:p></o:p></p>
</blockquote>
</div>
<blockquote style="margin-top:5.0pt;margin-bottom:5.0pt">
<div>
<p class="MsoNormal"><span lang="NL">Hi, </span><o:p></o:p></p>
<p class="MsoNormal"><span lang="NL"> </span><o:p></o:p></p>
<p class="MsoNormal"><span lang="EN-US">Probably the
wording of the subject doesn’t actually cover the
issue, what we see is this :</span><o:p></o:p></p>
<p class="MsoNormal"><span lang="EN-US">We have a client
behind a router (linking tcp to Omnipath) that shows
an inactive OST (all on 2.15.5).</span><o:p></o:p></p>
<p class="MsoNormal"><span lang="EN-US">Other clients
that go through the router do not have this issue.
</span><o:p></o:p></p>
<p class="MsoNormal"><span lang="EN-US"> </span><o:p></o:p></p>
<p class="MsoNormal"><span lang="EN-US">One client had
the same issue, although it showed a different OST
as inactive.</span><o:p></o:p></p>
<p class="MsoNormal"><span lang="EN-US">After a reboot,
all was well again on that machine.</span><o:p></o:p></p>
<p class="MsoNormal"><span lang="EN-US"> </span><o:p></o:p></p>
<p class="MsoNormal"><span lang="EN-US">The clients can
lctl ping the OSSs.</span><o:p></o:p></p>
<p class="MsoNormal"><span lang="EN-US"> </span><o:p></o:p></p>
<p class="MsoNormal"><span lang="EN-US">So although we
have a workaround (reboot the client), it would be
nice to:</span><o:p></o:p></p>
<ol style="margin-top:0cm" type="1" start="1">
<li class="MsoListParagraph"
style="margin-left:0cm;mso-list:l0 level1 lfo3"><span
lang="EN-US">Fix the issue without a reboot</span><o:p></o:p></li>
<li class="MsoListParagraph"
style="margin-left:0cm;mso-list:l0 level1 lfo3"><span
lang="EN-US">Fix the underlying issue.</span><o:p></o:p></li>
</ol>
<p class="MsoNormal"><span lang="EN-US"> </span><o:p></o:p></p>
<p class="MsoNormal"><span lang="EN-US">It might be
unrelated, but we also see another routing issue
every now and then:</span><o:p></o:p></p>
<p class="MsoNormal"><span lang="EN-US">The router stops
routing request toward a certain OSS, and this can
be fixed by deleting the peer_nid of the OSS from
the router.</span><o:p></o:p></p>
<p class="MsoNormal"><span lang="EN-US"> </span><o:p></o:p></p>
<p class="MsoNormal"><span lang="EN-US">I am probably
missing informative logs, but I’m more than happy to
try to generate them, if somebody has a pointer to
how.</span><o:p></o:p></p>
<p class="MsoNormal"><span lang="EN-US"> </span><o:p></o:p></p>
<p class="MsoNormal"><span lang="EN-US">We are a bit
stumped right now.</span><o:p></o:p></p>
<p class="MsoNormal"><span lang="EN-US"> </span><o:p></o:p></p>
<p class="MsoNormal"><span lang="EN-US">With kind
regards,</span><o:p></o:p></p>
<p class="MsoNormal"><span lang="EN-US"> </span><o:p></o:p></p>
<div>
<div>
<p class="MsoNormal"><span
style="font-size:9.0pt;font-family:Consolas;color:black;border:none windowtext 1.0pt;padding:0cm;mso-fareast-language:EN-GB">-- </span><o:p></o:p></p>
<p class="MsoNormal"><span
style="font-size:9.0pt;font-family:Consolas;color:black;border:none windowtext 1.0pt;padding:0cm;mso-fareast-language:EN-GB">Jan
van Haarst</span><o:p></o:p></p>
<p class="MsoNormal"><span
style="font-size:9.0pt;font-family:Consolas;color:black;border:none windowtext 1.0pt;padding:0cm;mso-fareast-language:EN-GB"
lang="NL">HPC
</span><span
style="font-size:9.0pt;font-family:Consolas;color:black;border:none windowtext 1.0pt;padding:0cm;mso-fareast-language:EN-GB">Administrator</span><o:p></o:p></p>
<p class="MsoNormal"><span
style="font-size:9.0pt;font-family:Consolas;color:black;border:none windowtext 1.0pt;padding:0cm;mso-fareast-language:EN-GB">For
Anunna/HPC questions, please use <a
href="https://urldefense.us/v3/__https:/support.wur.nl__;!!G2kpM7uM-TzIFchu!1YPSOGUFPvipdg8HUxDkmcB7rvfUxuSATnKZq-9LFTP16TrMxtlrPe7m3ccX4BmKFoLsVnaKiIL3u4pxK2GT6mMjyuAoAg$"
moz-do-not-send="true"><span
style="color:#0563C1">https://support.wur.nl</span></a> (with
HPC as service)</span><o:p></o:p></p>
<p class="MsoNormal"><span
style="font-size:9.0pt;font-family:Consolas;color:black;border:none windowtext 1.0pt;padding:0cm;mso-fareast-language:EN-GB"
lang="NL">Aanwezig: maandag, dinsdag, donderdag
& vrijdag </span><o:p></o:p></p>
<p class="MsoNormal"><span
style="font-size:9.0pt;font-family:Consolas;color:black;border:none windowtext 1.0pt;padding:0cm;mso-fareast-language:EN-GB"
lang="NL">Facilitair Bedrijf, onderdeel van
Wageningen University & Research </span><o:p></o:p></p>
<p class="MsoNormal"><span
style="font-size:9.0pt;font-family:Consolas;color:black;border:none windowtext 1.0pt;padding:0cm;mso-fareast-language:EN-GB"
lang="NL">Afdeling Informatie Technologie </span><o:p></o:p></p>
<p class="MsoNormal"><span
style="font-size:9.0pt;font-family:Consolas;color:black;border:none windowtext 1.0pt;padding:0cm;mso-fareast-language:EN-GB"
lang="NL">Postbus 59, 6700 AB, Wageningen </span><o:p></o:p></p>
<p class="MsoNormal"><span
style="font-size:9.0pt;font-family:Consolas;color:black;border:none windowtext 1.0pt;padding:0cm;mso-fareast-language:EN-GB"
lang="NL">Gebouw 116, Akkermaalsbos 12, 6700 WB,
Wageningen </span><o:p></o:p></p>
<p class="MsoNormal"><span
style="font-size:9.0pt;font-family:Consolas;color:black;border:none windowtext 1.0pt;padding:0cm;mso-fareast-language:EN-GB"><a
href="https://urldefense.us/v3/__http:/www.wur.nl/nl/Disclaimer.htm__;!!G2kpM7uM-TzIFchu!1YPSOGUFPvipdg8HUxDkmcB7rvfUxuSATnKZq-9LFTP16TrMxtlrPe7m3ccX4BmKFoLsVnaKiIL3u4pxK2GT6mP2LXgG1Q$"
title="http://www.wur.nl/nl/Disclaimer.htm"
moz-do-not-send="true"><span
style="color:#0563C1">http://www.wur.nl/nl/Disclaimer.htm</span></a></span><o:p></o:p></p>
</div>
</div>
<p class="MsoNormal"> <o:p></o:p></p>
<p class="MsoNormal"><span style="font-size:12.0pt">_______________________________________________<br>
lustre-discuss mailing list<br>
<a href="mailto:lustre-discuss@lists.lustre.org"
moz-do-not-send="true"
class="moz-txt-link-freetext">lustre-discuss@lists.lustre.org</a><br>
<a
href="https://urldefense.us/v3/__http:/lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org__;!!G2kpM7uM-TzIFchu!1YPSOGUFPvipdg8HUxDkmcB7rvfUxuSATnKZq-9LFTP16TrMxtlrPe7m3ccX4BmKFoLsVnaKiIL3u4pxK2GT6mNJQIy33g$"
moz-do-not-send="true">http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org</a><o:p></o:p></span></p>
</div>
</blockquote>
</div>
<p class="MsoNormal"><span style="font-size:12.0pt"><br>
<br>
<o:p></o:p></span></p>
<pre>_______________________________________________<o:p></o:p></pre>
<pre>lustre-discuss mailing list<o:p></o:p></pre>
<pre><a href="mailto:lustre-discuss@lists.lustre.org"
moz-do-not-send="true" class="moz-txt-link-freetext">lustre-discuss@lists.lustre.org</a><o:p></o:p></pre>
<pre><a
href="https://urldefense.us/v3/__http:/lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org__;!!G2kpM7uM-TzIFchu!1YPSOGUFPvipdg8HUxDkmcB7rvfUxuSATnKZq-9LFTP16TrMxtlrPe7m3ccX4BmKFoLsVnaKiIL3u4pxK2GT6mNJQIy33g$"
moz-do-not-send="true">https://urldefense.us/v3/__http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org__;!!G2kpM7uM-TzIFchu!1YPSOGUFPvipdg8HUxDkmcB7rvfUxuSATnKZq-9LFTP16TrMxtlrPe7m3ccX4BmKFoLsVnaKiIL3u4pxK2GT6mNJQIy33g$</a> <o:p></o:p></pre>
</blockquote>
</div>
<br>
<fieldset class="moz-mime-attachment-header"></fieldset>
<pre class="moz-quote-pre" wrap="">_______________________________________________
lustre-discuss mailing list
<a class="moz-txt-link-abbreviated" href="mailto:lustre-discuss@lists.lustre.org">lustre-discuss@lists.lustre.org</a>
<a class="moz-txt-link-freetext" href="http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org">http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org</a>
</pre>
</blockquote>
<br>
</body>
</html>