[lustre-discuss] BCP for High Availability?

Thu Jan 19 12:45:24 PST 2023

We (LLNL) were probably that Lab using pacemaker-remote, and we still 
are as it generally works and is what we're used to. That said, on an 
upcoming system, we may end up trying 2-node HA clusters due to the 
vendor's preference. I'm not sure what specifics you're interested in, 
but as you mention, the PM-remote option let's one cluster bring or down 
the entire file system and can handle fencing and resource management 
for everyone. The biggest caveat with this method (learned harshly by 
numerous folks) is not to do 'systemctl stop pacemaker' on that central 
node unless you really want to take down the entire file system.

On 1/15/23 18:37, Andrew Elwell via lustre-discuss wrote:
> Hi Folks,
>
> I'm just rebuilding my testbed and have got to the "sort out all the
> pacemaker stuff" part. What's the best current practice for the
> current LTS (2.15.x) release tree?
>
> I've always done this as multiple individual HA clusters covering each
> pair of servers with common dual connected drive array(s), but I
> remember seeing a talk some years ago where one of the US labs was
> using ?pacemaker-remote? and bringing them all up from a central node
>
> I note there's a few (old) crib notes on the wiki - referenced from
> the lustre manual, but nothing updated in the last couple of years.
>
> What are people out there doing?
>
>
> Many thanks
>
> Andrew
> _______________________________________________
> lustre-discuss mailing list
> lustre-discuss at lists.lustre.org
> https://urldefense.us/v3/__http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org__;!!G2kpM7uM-TzIFchu!hA_mvzRa3TBp976BGEStcbJQ5HQrSaOHqnwTEkb-TKQGmwf1LaBDZXvRl7ULJ4Q$