[Lustre-discuss] Lustre DRBD failover time

Tue Jul 14 08:54:38 PDT 2009

Hi, all,

I am evaluating Lustre with DRBD failover, and experiencing about 2 minutes in OSS failover time to switch to the secondary node.  Has anyone have the similar observation (so that we can conclude this should be expected), or if there is some parameters that I should tune to reduce that time?

I have a simple setup: the MDS and OSS0 are hosted on server1, and OSS1 are hosted on server2.  OSS0 and OSS1 are the primary nodes for OST0 and OST1, respectively, and the OSTs are replicated using DRBD (protocol C) to the other machine.  The two OSTs are about 73GB each.  I am running Lustre 1.6 + DRBD 8 + Heartbeat v2 (but using v1 configuration).

>From HA logs, it looks that Heartbeat noticed a node is down within 10 seconds (with is consistent with the deadtime of 6 seconds).  Where does the secondary node spend the remaining 100-110 seconds?  There was a post (http://groups.google.com/group/lustre-discuss-list/msg/bbbeac047df678ca?dmode=source) contributing MDS failover time to fsck.  Does it also cause my problem?

Thanks,

-Tao

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20090714/e67c37ed/attachment.htm>