[Lustre-discuss] Question on setting up fail-over

Tue Aug 10 12:47:06 PDT 2010

On Tuesday, August 10, 2010, David Noriega wrote:
> So your script resets the server so there is no fail-over(ie the other
> server takes over resources from that server?) or there is failover
> but you then manually return resources back to the server that was
> reset?

Our ddn ipmi stonith script (external/ipmi_ddn in heartbeat/pacemaker stonith 
terms) only makes absolutely sure the node was really reset. If something 
fails, an error code is reported to pacemaker and then pacemaker (*) will not 
initiate resource fail-over in order to prevent split-brain. 
As Lustre devices use MMP (multiple-mount protection) that is not strictly 
required, in principal. But if something goes wrong. e.g. MMP was accidentally 
not enabled, a double mount could come up and that would cause serious 
filesystem and data corruption... 

Cheers,
Bernd

PS: (*) hearbeat-v1 (and v2/v3 if not in xml/crm mode) also *should* accept 
stonith error codes, but in general, I have seen it more than once that 
hearbeat-v1 run into split-brain and started resources on both cluster nodes. 
That is something where pacemaker does a much better job.

-- 
Bernd Schubert
DataDirect Networks