[Lustre-discuss] Lustre DRBD failover time

Tue Jul 14 09:42:54 PDT 2009

tao.a.wu at nokia.com wrote:
>  
> Hi, all,
>  
> I am evaluating Lustre with DRBD failover, and experiencing about 2 
> minutes in OSS failover time to switch to the secondary node.  Has 
> anyone have the similar observation (so that we can conclude this should 
> be expected), or if there is some parameters that I should tune to 
> reduce that time?
>  
> I have a simple setup: the MDS and OSS0 are hosted on server1, and OSS1 
> are hosted on server2.  OSS0 and OSS1 are the primary nodes for OST0 and 
> OST1, respectively, and the OSTs are replicated using DRBD (protocol C) 
> to the other machine.  The two OSTs are about 73GB each.  I am running 
> Lustre 1.6 + DRBD 8 + Heartbeat v2 (but using v1 configuration).
>  
>  From HA logs, it looks that Heartbeat noticed a node is down within 10 
> seconds (with is consistent with the deadtime of 6 seconds).  Where does 
> the secondary node spend the remaining 100-110 seconds?  There was a 
> post 
> (_http://groups.google.com/group/lustre-discuss-list/msg/bbbeac047df678ca?dmode=source_) 
> contributing MDS failover time to fsck.  Does it also cause my problem?

as Brian mentioned, Lustre servers go through a recovery process.
You need to examine system logs on the OSS - if Lustre is in recovery, 
there will be messages in the logs explaining this.

cliffw

> Thanks,
>  
> -Tao
>  
>  
>  
>  
> 
> 
> ------------------------------------------------------------------------
> 
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss