<div dir="ltr">Hi Colin,<div><br></div><div>I have a small drawing which represents the setup, it's attached.<br><br></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Fri, 19 Nov 2021 at 22:49, Colin Faber <<a href="mailto:cfaber@gmail.com">cfaber@gmail.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div>Hi Koos,</div><div><br></div><div>One thing you mentioned that I should have picked up on sooner, was "The servers are connected in a multirail network, because some clients are in IB and the other clients are on ethernet"</div><div><br></div><div>Can you describe your topology? How are the various elements connected to each other?</div><div><br></div><div>-cf</div><div><br></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Fri, Nov 19, 2021 at 5:38 AM Meijering, Koos <<a href="mailto:h.meijering@rug.nl" target="_blank">h.meijering@rug.nl</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr">One more addition, I also the following message on the oss who had the ost before the failover:<br>Nov 19 12:43:59 dh4-oss01 kernel: LustreError: 137-5: muse-OST0001_UUID: not available for connect from 172.23.53.214@o2ib4 (no target). If you are running an HA pair check that the target is mounted on the other server.<br></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Fri, 19 Nov 2021 at 12:01, Meijering, Koos <<a href="mailto:h.meijering@rug.nl" target="_blank">h.meijering@rug.nl</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr">Hi Colin,<br><br>I've added here 3 log file 1 from the metadata and 2 from the object stores.<br>Before this logs started the filesystem was working, then I requested the cluster to failover muse-OST0001 from oss01 to oss02.<br><br></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Thu, 18 Nov 2021 at 17:11, Colin Faber <<a href="mailto:cfaber@gmail.com" target="_blank">cfaber@gmail.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div>Hi Koos,</div><div><br></div><div>First thing -- it's generally a bad idea to run newer server versions with older clients (the opposite isn't true).</div><div><br></div><div> Second -- do you have any logging that you can share from the client itself? (dmesg, syslog, etc)<br></div><div><br></div><div>A quick test may be to run 2.12.7 clients against your cluster to verify that there is no interop problem.<br></div><div><br></div><div>-cf</div><div><br></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Thu, Nov 18, 2021 at 8:58 AM Meijering, Koos via lustre-discuss <<a href="mailto:lustre-discuss@lists.lustre.org" target="_blank">lustre-discuss@lists.lustre.org</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr">Hi all,<div><br></div><div>We have build a lustre cluster server environment on CentOS7 and lustre 2.12.7<br>The clients are using 2.12.5<br>The setup is 3 clusters for a 3PB filesystem<br>One cluster is a two node cluster built for MGS and MDT's <br>The other two clusters are also two node cluster used for the OST's</div><div>The cluster framework is working as expected.<br><br>The servers are connected in a multirail network, because some clients are in IB and the other clients are on ethernet<br><br>But we have the following problem. When an OST failover to the second node the clients are unable to contact the OST that is started on the oder node.<br>The OST recovery status is waiting for clients</div><div>When we fail it back it starts working again and the recovery status is compple</div><div><br>We tried to abort the recovery but that does not work.</div><div><br></div><div>We used these documents to build the cluster:</div><div><a href="https://wiki.lustre.org/Creating_the_Lustre_Management_Service_(MGS)" target="_blank">https://wiki.lustre.org/Creating_the_Lustre_Management_Service_(MGS)</a><br><a href="https://wiki.lustre.org/Creating_the_Lustre_Metadata_Service_(MDS)" target="_blank">https://wiki.lustre.org/Creating_the_Lustre_Metadata_Service_(MDS)</a><br><a href="https://wiki.lustre.org/Creating_Lustre_Object_Storage_Services_(OSS)" target="_blank">https://wiki.lustre.org/Creating_Lustre_Object_Storage_Services_(OSS)</a><br><a href="https://wiki.lustre.org/Creating_Pacemaker_Resources_for_Lustre_Storage_Services" target="_blank">https://wiki.lustre.org/Creating_Pacemaker_Resources_for_Lustre_Storage_Services</a><br></div><div><br></div><div>I'm not sure what the next steps must be to find the problem and where to look.</div><div><br></div><div><div dir="ltr"><div dir="ltr"><div dir="ltr"><div>Best regards</div><div>Koos Meijering</div><div>........................................................................</div><div>HPC Team<br></div><div></div><div>Rijksuniversiteit Groningen<br></div><div>........................................................................</div></div></div></div></div></div>

_______________________________________________<br>

lustre-discuss mailing list<br>

<a href="mailto:lustre-discuss@lists.lustre.org" target="_blank">lustre-discuss@lists.lustre.org</a><br>

<a href="http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org" rel="noreferrer" target="_blank">http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org</a><br>

</blockquote></div>

</blockquote></div>

</blockquote></div>

</blockquote></div>

</blockquote></div>