[Lustre-discuss] failover software - heartbeat

Jim Garlick garlick at llnl.gov
Mon Jul 13 13:39:08 PDT 2009


No.  I originally did have it set up like this (a v1 ha.cf snippet):

# One partner losing contact with both lnet routers or MDS triggers failover.
#ping_group lnet-router 172.16.10.254 172.16.2.254
#ping_group tycho-mds1 172.16.10.200 172.16.2.200
#respawn hacluster /usr/lib64/heartbeat/ipfail

However, I ran into a problem when rebooting the MDS.  Apparently if one
partner re-establishes contact with the MDS before the other one, it
immediately triggers failover.  This is with heartbeat-2.1.4.

Jim

On Mon, Jul 13, 2009 at 02:25:17PM -0600, Lundgren, Andrew wrote:
> Were you able to get monitoring working to detect network failures?  (pingd?)
> 
> I have it configured, but haven't been able to get it to trigger a failover when an MDS cannot ping the network.  (I tried with 1.0 and 2.0 conf files,  I am currently using 2.0)  I have a ticket open with the pacemaker project (no ticket system for the HA stuff...)
> but not resolution.  I am considering writing a script to down the node when the ping fails, but don't like the idea.  
> 
> I would also like to get the hpingd functioning to detect a fiber failure, but there was less available on that solution.
> 
> --
> Andrew
> 
> > -----Original Message-----
> > From: Jim Garlick [mailto:garlick at llnl.gov]
> > Sent: Monday, July 13, 2009 2:21 PM
> > To: Lundgren, Andrew
> > Cc: Carlos Santana; lustre-discuss at lists.lustre.org
> > Subject: Re: [Lustre-discuss] failover software - heartbeat
> > 
> > We recently put heartbeat v1 in production and along the way
> > developed some admin scripts including heartbeat resource agent
> > compliant
> > lustre init scripts, a script to initiate failover/failback and get
> > detailed
> > status, a powerman stonith interface, and various safeguards to ensure
> > MMP
> > is on, devices are present and usable, etc. before starting lustre.
> > 
> > If this is of general interest I could post it to a bug for review.
> > 
> > Jim
> > 
> > On Mon, Jul 13, 2009 at 01:45:02PM -0600, Lundgren, Andrew wrote:
> > > It is very difficult to find relevant documentation for heartbeat
> > 1/2. I just finished configuring a heartbeat system and would not
> > recommend it because of the documentation.  (They seem to have removed
> > portions the heartbeat documentation from the site.)
> > >
> > > Pacemaker is not a simple solution to configure either. I played
> > briefly with the RH clustering software.  It does not directly support
> > any FS type other than the basic ext2/ext3, and wasn't happy with a
> > lustre type.
> > >
> > > --
> > > Andrew
> > >
> > > > -----Original Message-----
> > > > From: lustre-discuss-bounces at lists.lustre.org [mailto:lustre-
> > discuss-
> > > > bounces at lists.lustre.org] On Behalf Of Carlos Santana
> > > > Sent: Monday, July 13, 2009 11:42 AM
> > > > To: lustre-discuss at lists.lustre.org
> > > > Subject: [Lustre-discuss] failover software - heartbeat
> > > >
> > > > Howdy,
> > > >
> > > > The lustre manual recommends heartbeat for handling failover. The
> > > > pacemaker is successor of hearbeat version 2. So whats recommended
> > -
> > > > should we be using pacemaker or stick to hearbeat?
> > > >
> > > > -
> > > > CS.
> > > > _______________________________________________
> > > > Lustre-discuss mailing list
> > > > Lustre-discuss at lists.lustre.org
> > > > http://**lists.lustre.org/mailman/listinfo/lustre-discuss
> > > _______________________________________________
> > > Lustre-discuss mailing list
> > > Lustre-discuss at lists.lustre.org
> > > http://**lists.lustre.org/mailman/listinfo/lustre-discuss



More information about the lustre-discuss mailing list