[Lustre-discuss] Question on setting up fail-over

Bernd Schubert bs_lists at aakef.fastmail.fm
Tue Aug 10 11:39:15 PDT 2010


On Tuesday, August 10, 2010, Kevin Van Maren wrote:
> Depends on the HA package you are using.  Heartbeat comes with a script
> that supports IPMI.
> 

For our installations we even use a modified external/ipmi_ddn stonith script 
that does uses power-off/status/on to make sure the system is really reset. 
The heartbeat/pacemaker script uses the ipmi reset method by default, but ipmi 
commands are not required by specs to succeed. So ipmitool (used by 
external/ipmi) might successfully return, but does in way ensure the node was 
really reset. I have seen that rather often in real life already.
The default script also supports the power-off/on method, but also does not 
check for the status. 

So our modified script first powers off, then checks if the node is really 
offline, then powers on again and only then successfully returns. 
Unfortunately, that is at the cost of an increased fail-over time, as power-
off and then power-on needs some minimal downtime in between (ca. 30s) and 
heartbeats/pacemaker stonith does not support async events (power-off would be 
sufficient, but once stonith successfully returns, it is not called again till 
the next fencing).

-- 
Bernd Schubert
DataDirect Networks



More information about the lustre-discuss mailing list