[Lustre-discuss] Question on setting up fail-over

Wojciech Turek wjt27 at cam.ac.uk
Tue Aug 10 15:09:39 PDT 2010


I would recommend the heartbeat with pacemaker setup for the fail-over
control. The configuration may seem complex at the beginning but after
enough reading (and there is many good sources) it is quite easy to setup. I
have recently set up a Lustre system with 3 OSSs and two MDSs (DRBD with LVM
between them) working as a single HA cluster and it was easy enough.
Pacemaker allows single point of administration of lustre system (starting
and stopping the filesystem) and there is a neat GUI for those who want to
show something to their managers :)

Best regards,

Wojciech

On 10 August 2010 20:47, Bernd Schubert <bs_lists at aakef.fastmail.fm> wrote:

>
> On Tuesday, August 10, 2010, David Noriega wrote:
> > So your script resets the server so there is no fail-over(ie the other
> > server takes over resources from that server?) or there is failover
> > but you then manually return resources back to the server that was
> > reset?
>
> Our ddn ipmi stonith script (external/ipmi_ddn in heartbeat/pacemaker
> stonith
> terms) only makes absolutely sure the node was really reset. If something
> fails, an error code is reported to pacemaker and then pacemaker (*) will
> not
> initiate resource fail-over in order to prevent split-brain.
> As Lustre devices use MMP (multiple-mount protection) that is not strictly
> required, in principal. But if something goes wrong. e.g. MMP was
> accidentally
> not enabled, a double mount could come up and that would cause serious
> filesystem and data corruption...
>
>
> Cheers,
> Bernd
>
> PS: (*) hearbeat-v1 (and v2/v3 if not in xml/crm mode) also *should* accept
> stonith error codes, but in general, I have seen it more than once that
> hearbeat-v1 run into split-brain and started resources on both cluster
> nodes.
> That is something where pacemaker does a much better job.
>
> --
> Bernd Schubert
> DataDirect Networks
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>



-- 
Wojciech Turek

Senior System Architect

High Performance Computing Service
University of Cambridge
Email: wjt27 at cam.ac.uk
Tel: (+)44 1223 763517
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20100810/1f5bf39a/attachment.htm>


More information about the lustre-discuss mailing list