[Lustre-discuss] Lustre DRBD failover time

Tue Jul 14 09:10:29 PDT 2009

On Tue, 2009-07-14 at 17:54 +0200, tao.a.wu at nokia.com wrote:
>  
> Hi, all,
>  
> I am evaluating Lustre with DRBD failover, and experiencing about 2
> minutes in OSS failover time to switch to the secondary node.

What is this 2 minutes including?  Just the time for the second OSS to
mount the disk and start recovery or is it 2 minutes to detect the
primary failure and have the secondary complete recovery so that the
clients are fully functional again?

If the latter, then you are doing quite well.  Recovery is not an
instantaneous process.  Much work needs to be done to ensure coherency
between what is on the disk of the failed over OST and what the clients
think is on disk.  Getting to this state requires that all clients
synchronize with the OST and getting/waiting for many clients to do this
can, currently, take many minutes as each client has to first notice the
primary is dead and sync up with the failover.  Some clients might not
even be available to sync, in which case you have to wait for a timeout.

So if you are talking 2 minutes from failure to full recovery, you are
not likely going to put much of a dent in this.

Lustre 1.8 has adaptive timeouts enabled and that should help in optimal
situations, but it will still take time to do a full recovery.

b.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 197 bytes
Desc: This is a digitally signed message part
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20090714/6fd0308a/attachment.pgp>