<div dir="ltr">Hi Laura,  thanks for your reply.<div>It seems the OSSs will share the disks created from a shared SAN.  So the OSS-pairs can failover in a pre-defined manner if one node is down, coordinated by a HA manager.</div><div><br></div><div>This can certainly work on a limited scale.  I'm curious if this static schema can scale to a large cluster with 100s of OSSs servers?</div><div><br></div><div><br></div><div>regards,</div><div>Shawn</div><div><br></div><div><br></div><div><div><br></div></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Tue, Jul 18, 2023 at 1:25 PM Laura Hild <<a href="mailto:lsh@jlab.org">lsh@jlab.org</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">I'm not familiar with using FLR to tolerate OSS failures.  My site does the HA pairs with shared storage method.  It's sort of described in the manual<br>

<br>

  <a href="https://doc.lustre.org/lustre_manual.xhtml#configuringfailover" rel="noreferrer" target="_blank">https://doc.lustre.org/lustre_manual.xhtml#configuringfailover</a><br>

<br>

but in more, Pacemaker-specific detail at<br>

<br>

  <a href="https://wiki.lustre.org/Creating_a_Framework_for_High_Availability_with_Pacemaker" rel="noreferrer" target="_blank">https://wiki.lustre.org/Creating_a_Framework_for_High_Availability_with_Pacemaker</a><br>

<br>

and<br>

<br>

  <a href="https://wiki.lustre.org/Creating_Pacemaker_Resources_for_Lustre_Storage_Services" rel="noreferrer" target="_blank">https://wiki.lustre.org/Creating_Pacemaker_Resources_for_Lustre_Storage_Services</a><br>

<br>

</blockquote></div>