[lustre-discuss] how does lustre handle node failure

Fri Jul 21 15:04:32 PDT 2023

Hi Laura,  thanks for your reply.
It seems the OSSs will share the disks created from a shared SAN.  So the
OSS-pairs can failover in a pre-defined manner if one node is down,
coordinated by a HA manager.

This can certainly work on a limited scale.  I'm curious if this static
schema can scale to a large cluster with 100s of OSSs servers?

regards,
Shawn

On Tue, Jul 18, 2023 at 1:25 PM Laura Hild <lsh at jlab.org> wrote:

> I'm not familiar with using FLR to tolerate OSS failures.  My site does
> the HA pairs with shared storage method.  It's sort of described in the
> manual
>
>   https://doc.lustre.org/lustre_manual.xhtml#configuringfailover
>
> but in more, Pacemaker-specific detail at
>
>
> https://wiki.lustre.org/Creating_a_Framework_for_High_Availability_with_Pacemaker
>
> and
>
>
> https://wiki.lustre.org/Creating_Pacemaker_Resources_for_Lustre_Storage_Services
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20230721/73321982/attachment.htm>