[Lustre-discuss] Question on setting up fail-over
Bernd Schubert
bs_lists at aakef.fastmail.fm
Tue Aug 10 11:39:15 PDT 2010
On Tuesday, August 10, 2010, Kevin Van Maren wrote:
> Depends on the HA package you are using. Heartbeat comes with a script
> that supports IPMI.
>
For our installations we even use a modified external/ipmi_ddn stonith script
that does uses power-off/status/on to make sure the system is really reset.
The heartbeat/pacemaker script uses the ipmi reset method by default, but ipmi
commands are not required by specs to succeed. So ipmitool (used by
external/ipmi) might successfully return, but does in way ensure the node was
really reset. I have seen that rather often in real life already.
The default script also supports the power-off/on method, but also does not
check for the status.
So our modified script first powers off, then checks if the node is really
offline, then powers on again and only then successfully returns.
Unfortunately, that is at the cost of an increased fail-over time, as power-
off and then power-on needs some minimal downtime in between (ca. 30s) and
heartbeats/pacemaker stonith does not support async events (power-off would be
sufficient, but once stonith successfully returns, it is not called again till
the next fencing).
--
Bernd Schubert
DataDirect Networks
More information about the lustre-discuss
mailing list