[lustre-discuss] how does lustre handle node failure

Andreas Dilger adilger at whamcloud.com
Sat Jul 22 21:35:17 PDT 2023


Shawn,
Lustre handles the largest filesystems in the world, hundreds of PB in size, so there are definitely Lustre filesystems with hundreds of servers.

In large storage clusters the servers failover in pairs or quads, since the storage is typically not on a single global SAN for all nodes to access, so there is definitely not a single huge HA cluster for all of the servers in the filesystem.

Cheers, Andreas

On Jul 21, 2023, at 16:09, Shawn via lustre-discuss <lustre-discuss at lists.lustre.org> wrote:


Hi Laura,  thanks for your reply.
It seems the OSSs will share the disks created from a shared SAN.  So the OSS-pairs can failover in a pre-defined manner if one node is down, coordinated by a HA manager.

This can certainly work on a limited scale.  I'm curious if this static schema can scale to a large cluster with 100s of OSSs servers?


regards,
Shawn




On Tue, Jul 18, 2023 at 1:25 PM Laura Hild <lsh at jlab.org<mailto:lsh at jlab.org>> wrote:
I'm not familiar with using FLR to tolerate OSS failures.  My site does the HA pairs with shared storage method.  It's sort of described in the manual

  https://doc.lustre.org/lustre_manual.xhtml#configuringfailover

but in more, Pacemaker-specific detail at

  https://wiki.lustre.org/Creating_a_Framework_for_High_Availability_with_Pacemaker

and

  https://wiki.lustre.org/Creating_Pacemaker_Resources_for_Lustre_Storage_Services

_______________________________________________
lustre-discuss mailing list
lustre-discuss at lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20230723/b1ddc7cb/attachment.htm>


More information about the lustre-discuss mailing list