[Lustre-discuss] Large Corosync/Pacemaker clusters

Tue Oct 30 18:43:09 PDT 2012

Thanks for the replies.  We've worked on the HA and have it to a
satisfactory point where we can put it into production.  We broke it
into a MDS pair and 4 groups of 4 OSS nodes.  From our perspective, it's
actually easier to manage groups of 4 than groups of 2, since it's half
as many configurations to keep track of.

After splitting the cluster into 5 pieces it has become much more
responsive and stable.  It's more difficult to manage than one large
cluster, but the stability is obviously worth it.  We've been performing
heavy load testing and have not been able to "break" the cluster.  We
did a few more things to get to this point:

- Lowered the nice value of the corosync process to make it more
responsive under load and prevent a node from getting kicked out due to
unresponsiveness.
- Increased vm.min_free_kbytes to give TCP/IP w/ jumbo frames room to
move around.  Without this certain nodes would have low memory issues
related to networking and would get stonithed due to unresponsiveness.

Thanks,
Shawn

-----Original Message-----
From: Charles Taylor [mailto:taylor at hpc.ufl.edu] 
Sent: Wednesday, October 24, 2012 3:33 PM
To: Hall, Shawn
Cc: lustre-discuss at lists.lustre.org
Subject: Re: [Lustre-discuss] Large Corosync/Pacemaker clusters

FWIW, we are running HA Lustre using corosync/pacemaker.    We broke our
OSSs and MDSs out into individual HA *pairs*.   Thought about other
configurations but it was our first step into corosync/pacemaker so we
decided to keep it as simple as possible.   Seems to work well.    I'm
not sure I would attempt what you are doing though it may be perfectly
fine.   When HA is a requirement, it probably makes sense to avoid
pushing the limits of what works.

Doesn't really help you much other than to provide a data point with
regard to what other sites are doing.   

Good luck and report back.   

Charlie Taylor
UF HPC Center

On Oct 19, 2012, at 12:52 PM, Hall, Shawn wrote:

> Hi,
>  
> We're setting up fairly large Lustre 2.1.2 filesystems, each with 18
nodes and 159 resources all in one Corosync/Pacemaker cluster as
suggested by our vendor.  We're getting mixed messages on how large of a
Corosync/Pacemaker cluster will work well between our vendor an others.
>  
> 1.       Are there Lustre Corosync/Pacemaker clusters out there of
this size or larger?
> 2.       If so, what tuning needed to be done to get it to work well?
> 3.       Should we be looking more seriously into splitting this
Corosync/Pacemaker cluster into pairs or sets of 4 nodes?
>  
> Right now, our current configuration takes a long time to start/stop
all resources (~30-45 mins), and failing back OSTs puts a heavy load on
the cib process on every node in the cluster.  Under heavy IO load, the
many of the nodes will show as "unclean/offline" and many OST resources
will show as inactive in crm status, despite the fact that every single
MDT and OST is still mounted in the appropriate place.  We are running 2
corosync rings, each on a private 1 GbE network.  We have a bonded 10
GbE network for the LNET.
>  
> Thanks,
> Shawn
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss