[Lustre-discuss] Question on setting up fail-over
Bernd Schubert
bs_lists at aakef.fastmail.fm
Tue Aug 10 12:47:06 PDT 2010
On Tuesday, August 10, 2010, David Noriega wrote:
> So your script resets the server so there is no fail-over(ie the other
> server takes over resources from that server?) or there is failover
> but you then manually return resources back to the server that was
> reset?
Our ddn ipmi stonith script (external/ipmi_ddn in heartbeat/pacemaker stonith
terms) only makes absolutely sure the node was really reset. If something
fails, an error code is reported to pacemaker and then pacemaker (*) will not
initiate resource fail-over in order to prevent split-brain.
As Lustre devices use MMP (multiple-mount protection) that is not strictly
required, in principal. But if something goes wrong. e.g. MMP was accidentally
not enabled, a double mount could come up and that would cause serious
filesystem and data corruption...
Cheers,
Bernd
PS: (*) hearbeat-v1 (and v2/v3 if not in xml/crm mode) also *should* accept
stonith error codes, but in general, I have seen it more than once that
hearbeat-v1 run into split-brain and started resources on both cluster nodes.
That is something where pacemaker does a much better job.
--
Bernd Schubert
DataDirect Networks
More information about the lustre-discuss
mailing list