[Lustre-discuss] Large Corosync/Pacemaker clusters

Fri Oct 19 09:52:14 PDT 2012

Hi,

We're setting up fairly large Lustre 2.1.2 filesystems, each with 18
nodes and 159 resources all in one Corosync/Pacemaker cluster as
suggested by our vendor.  We're getting mixed messages on how large of a
Corosync/Pacemaker cluster will work well between our vendor an others.

1.       Are there Lustre Corosync/Pacemaker clusters out there of this
size or larger?

2.       If so, what tuning needed to be done to get it to work well?

3.       Should we be looking more seriously into splitting this
Corosync/Pacemaker cluster into pairs or sets of 4 nodes?

Right now, our current configuration takes a long time to start/stop all
resources (~30-45 mins), and failing back OSTs puts a heavy load on the
cib process on every node in the cluster.  Under heavy IO load, the many
of the nodes will show as "unclean/offline" and many OST resources will
show as inactive in crm status, despite the fact that every single MDT
and OST is still mounted in the appropriate place.  We are running 2
corosync rings, each on a private 1 GbE network.  We have a bonded 10
GbE network for the LNET.

Thanks,

Shawn

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20121019/46326bb4/attachment.htm>