[Lustre-discuss] best practice for lustre clustre startup

Robin Humble robin.humble+lustre at anu.edu.au
Thu Jul 1 11:21:32 PDT 2010


On Thu, Jul 01, 2010 at 11:17:31AM -0600, Kevin Van Maren wrote:
>My (personal) opinion:
>
>Lustre clients should always start (mount) automatically.

yup

>Lustre servers should have their services started through heartbeat (or 
>other HA package), if failover is possible (be sure to configure stonith).

IMHO that's a bad idea. servers should not start automatically.

my objections to automated mount/failover are not Lustre related, but
to all layers underneath - as Kevin well knows, mptsas drivers can and
do and have screwed up majorly and I'm sure other drivers have too. md
is far from smart, and disks are broken in such an infinite amount of
weird and wonderful ways that no driver or OS can reasonably be
expected to deal with them all :-/

if you have the simple setup of singly-attached storage and a Lustre
server just crashed, then why wouldn't it just crash again? we have had
that happen. automated startup seems silly in this case - especially if
you don't know what the problem was to start with. worst case is if the
hardware started corrupting data and crashed the machine, is it really
a good idea to reboot, remount, continue corrupting data more, and then
keep rebooting until dawn?

if you have a more elaborate Lustre setup with HA failover pairs then
the above applies, and additionally there are inherent races in both
nodes in a pair trying to mount a set of disks if you do not have a
third impartial member participating in a failover chorum - not a
common HA setup for Lustre, although it probably should be.
if a sw raid is assembled on both machines at the same time because of
a HA race, then it's likely data will be lost. Lustre mmp should save
you from multi-mounting the OST, but obviously not from corruption if
the underlying raid is pre-trashed.

overall without diagnosing why a machine crashed I fail to see how an
automated reboot or failover can possibly be a safe course of action.

cheers,
robin

>If heartbeat starts automatically, do ensure auto-failback is NOT 
>enabled: fail the resources back manually after you verify the rebooted 
>server is healthy.
>Whether heartbeat starts automatically seems to be a preference issue.
>
>While unlikely, it is possible for an issue to cause Lustre to not start 
>successfully, resulting in a node crash or other issue preventing a 
>login.  So if it does start automatically you'll want to be prepared to 
>reboot w/o Lustre (eg, single-user mode).
>
>Kevin
>
>



More information about the lustre-discuss mailing list