[Lustre-discuss] Large Corosync/Pacemaker clusters

Wed Oct 24 13:58:07 PDT 2012

Shawn,

In my opinion you shouldn't be running corosync on any more than two 
machines. They should be configured in self contained pairs (mds pair, 
oss pairs). Anything beyond that would be chaos to manage, even if it 
worked. Don't forget the stonith portion. Not every block storage 
implementation respects mmp protection.

--Jeff

On 10/19/12 9:52 AM, Hall, Shawn wrote:
>
> Hi,
>
> We’re setting up fairly large Lustre 2.1.2 filesystems, each with 18 
> nodes and 159 resources all in one Corosync/Pacemaker cluster as 
> suggested by our vendor. We’re getting mixed messages on how large of 
> a Corosync/Pacemaker cluster will work well between our vendor an others.
>
> 1.Are there Lustre Corosync/Pacemaker clusters out there of this size 
> or larger?
>
> 2.If so, what tuning needed to be done to get it to work well?
>
> 3.Should we be looking more seriously into splitting this 
> Corosync/Pacemaker cluster into pairs or sets of 4 nodes?
>
> Right now, our current configuration takes a long time to start/stop 
> all resources (~30-45 mins), and failing back OSTs puts a heavy load 
> on the cib process on every node in the cluster. Under heavy IO load, 
> the many of the nodes will show as “unclean/offline” and many OST 
> resources will show as inactive in crm status, despite the fact that 
> every single MDT and OST is still mounted in the appropriate place. We 
> are running 2 corosync rings, each on a private 1 GbE network. We have 
> a bonded 10 GbE network for the LNET.
>
> Thanks,
>
> Shawn
>
>
>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss

-- 
------------------------------
Jeff Johnson
Co-Founder
Aeon Computing

jeff.johnson at aeoncomputing.com
www.aeoncomputing.com
t: 858-412-3810 x101   f: 858-412-3845
m: 619-204-9061

/* New Address */
4170 Morena Boulevard, Suite D - San Diego, CA 92117