[Lustre-discuss] Large Corosync/Pacemaker clusters

Tue Nov 6 05:12:58 PST 2012

Hi,

I'm also setting up a high-available Lustre system, I configured pairs 
for the OSSes and MDSes, redundant Corosync rings (two separate rings: 
IB and Eth), and Stonith is enabled.

The current configuration seems to work fine, however yesterday we 
experienced some problem because 4 OSSes got rebooted by Stonith. I 
suspect that Corosync missed a heartbeat due to a kernel/corosync hung, 
rather than a network problem. I will try the "renice" solution you 
proposed.

I have been thinking that I could increase the "token" timeout value in 
/etc/corosync/corosync.conf , to prevent short "hiccups". Did you 
specify a value to this parameter or did you leave the default 1000ms value?

Marco

On 2012-10-31 03:43, Hall, Shawn wrote:
> Thanks for the replies.  We've worked on the HA and have it to a
> satisfactory point where we can put it into production.  We broke it
> into a MDS pair and 4 groups of 4 OSS nodes.  From our perspective, it's
> actually easier to manage groups of 4 than groups of 2, since it's half
> as many configurations to keep track of.
>
> After splitting the cluster into 5 pieces it has become much more
> responsive and stable.  It's more difficult to manage than one large
> cluster, but the stability is obviously worth it.  We've been performing
> heavy load testing and have not been able to "break" the cluster.  We
> did a few more things to get to this point:
>
> - Lowered the nice value of the corosync process to make it more
> responsive under load and prevent a node from getting kicked out due to
> unresponsiveness.
> - Increased vm.min_free_kbytes to give TCP/IP w/ jumbo frames room to
> move around.  Without this certain nodes would have low memory issues
> related to networking and would get stonithed due to unresponsiveness.
>
> Thanks,
> Shawn
>
> -----Original Message-----
> From: Charles Taylor [mailto:taylor at hpc.ufl.edu]
> Sent: Wednesday, October 24, 2012 3:33 PM
> To: Hall, Shawn
> Cc: lustre-discuss at lists.lustre.org
> Subject: Re: [Lustre-discuss] Large Corosync/Pacemaker clusters
>
>
> FWIW, we are running HA Lustre using corosync/pacemaker.    We broke our
> OSSs and MDSs out into individual HA *pairs*.   Thought about other
> configurations but it was our first step into corosync/pacemaker so we
> decided to keep it as simple as possible.   Seems to work well.    I'm
> not sure I would attempt what you are doing though it may be perfectly
> fine.   When HA is a requirement, it probably makes sense to avoid
> pushing the limits of what works.
>
> Doesn't really help you much other than to provide a data point with
> regard to what other sites are doing.
>
> Good luck and report back.
>
> Charlie Taylor
> UF HPC Center
>
> On Oct 19, 2012, at 12:52 PM, Hall, Shawn wrote:
>
>> Hi,
>>
>> We're setting up fairly large Lustre 2.1.2 filesystems, each with 18
> nodes and 159 resources all in one Corosync/Pacemaker cluster as
> suggested by our vendor.  We're getting mixed messages on how large of a
> Corosync/Pacemaker cluster will work well between our vendor an others.
>>
>> 1.       Are there Lustre Corosync/Pacemaker clusters out there of
> this size or larger?
>> 2.       If so, what tuning needed to be done to get it to work well?
>> 3.       Should we be looking more seriously into splitting this
> Corosync/Pacemaker cluster into pairs or sets of 4 nodes?
>>
>> Right now, our current configuration takes a long time to start/stop
> all resources (~30-45 mins), and failing back OSTs puts a heavy load on
> the cib process on every node in the cluster.  Under heavy IO load, the
> many of the nodes will show as "unclean/offline" and many OST resources
> will show as inactive in crm status, despite the fact that every single
> MDT and OST is still mounted in the appropriate place.  We are running 2
> corosync rings, each on a private 1 GbE network.  We have a bonded 10
> GbE network for the LNET.
>>
>> Thanks,
>> Shawn
>> _______________________________________________
>> Lustre-discuss mailing list
>> Lustre-discuss at lists.lustre.org
>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss