[Lustre-discuss] Large Corosync/Pacemaker clusters
marco.passerini at csc.fi
Tue Nov 6 05:12:58 PST 2012
I'm also setting up a high-available Lustre system, I configured pairs
for the OSSes and MDSes, redundant Corosync rings (two separate rings:
IB and Eth), and Stonith is enabled.
The current configuration seems to work fine, however yesterday we
experienced some problem because 4 OSSes got rebooted by Stonith. I
suspect that Corosync missed a heartbeat due to a kernel/corosync hung,
rather than a network problem. I will try the "renice" solution you
I have been thinking that I could increase the "token" timeout value in
/etc/corosync/corosync.conf , to prevent short "hiccups". Did you
specify a value to this parameter or did you leave the default 1000ms value?
On 2012-10-31 03:43, Hall, Shawn wrote:
> Thanks for the replies. We've worked on the HA and have it to a
> satisfactory point where we can put it into production. We broke it
> into a MDS pair and 4 groups of 4 OSS nodes. From our perspective, it's
> actually easier to manage groups of 4 than groups of 2, since it's half
> as many configurations to keep track of.
> After splitting the cluster into 5 pieces it has become much more
> responsive and stable. It's more difficult to manage than one large
> cluster, but the stability is obviously worth it. We've been performing
> heavy load testing and have not been able to "break" the cluster. We
> did a few more things to get to this point:
> - Lowered the nice value of the corosync process to make it more
> responsive under load and prevent a node from getting kicked out due to
> - Increased vm.min_free_kbytes to give TCP/IP w/ jumbo frames room to
> move around. Without this certain nodes would have low memory issues
> related to networking and would get stonithed due to unresponsiveness.
> -----Original Message-----
> From: Charles Taylor [mailto:taylor at hpc.ufl.edu]
> Sent: Wednesday, October 24, 2012 3:33 PM
> To: Hall, Shawn
> Cc: lustre-discuss at lists.lustre.org
> Subject: Re: [Lustre-discuss] Large Corosync/Pacemaker clusters
> FWIW, we are running HA Lustre using corosync/pacemaker. We broke our
> OSSs and MDSs out into individual HA *pairs*. Thought about other
> configurations but it was our first step into corosync/pacemaker so we
> decided to keep it as simple as possible. Seems to work well. I'm
> not sure I would attempt what you are doing though it may be perfectly
> fine. When HA is a requirement, it probably makes sense to avoid
> pushing the limits of what works.
> Doesn't really help you much other than to provide a data point with
> regard to what other sites are doing.
> Good luck and report back.
> Charlie Taylor
> UF HPC Center
> On Oct 19, 2012, at 12:52 PM, Hall, Shawn wrote:
>> We're setting up fairly large Lustre 2.1.2 filesystems, each with 18
> nodes and 159 resources all in one Corosync/Pacemaker cluster as
> suggested by our vendor. We're getting mixed messages on how large of a
> Corosync/Pacemaker cluster will work well between our vendor an others.
>> 1. Are there Lustre Corosync/Pacemaker clusters out there of
> this size or larger?
>> 2. If so, what tuning needed to be done to get it to work well?
>> 3. Should we be looking more seriously into splitting this
> Corosync/Pacemaker cluster into pairs or sets of 4 nodes?
>> Right now, our current configuration takes a long time to start/stop
> all resources (~30-45 mins), and failing back OSTs puts a heavy load on
> the cib process on every node in the cluster. Under heavy IO load, the
> many of the nodes will show as "unclean/offline" and many OST resources
> will show as inactive in crm status, despite the fact that every single
> MDT and OST is still mounted in the appropriate place. We are running 2
> corosync rings, each on a private 1 GbE network. We have a bonded 10
> GbE network for the LNET.
>> Lustre-discuss mailing list
>> Lustre-discuss at lists.lustre.org
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
More information about the lustre-discuss