[Lustre-discuss] failover software - heartbeat

Jim Garlick garlick at llnl.gov
Mon Jul 13 14:05:08 PDT 2009


On network failures: no.

On fibre path failures: we configure ldiskfs with errors=panic so fibre
issues or other issues in the storage path will likely cause a panic and
trigger failover.

We're just getting started with failover so we elected to keep it simple
for now.

Jim

On Mon, Jul 13, 2009 at 02:41:09PM -0600, Lundgren, Andrew wrote:
> Are you doing anything if the network fails to one mds?
> 
> How about if your fiber path fails?
> 
> > -----Original Message-----
> > From: Jim Garlick [mailto:garlick at llnl.gov]
> > Sent: Monday, July 13, 2009 2:39 PM
> > To: Lundgren, Andrew
> > Cc: Carlos Santana; lustre-discuss at lists.lustre.org
> > Subject: Re: [Lustre-discuss] failover software - heartbeat
> > 
> > No.  I originally did have it set up like this (a v1 ha.cf snippet):
> > 
> > # One partner losing contact with both lnet routers or MDS triggers
> > failover.
> > #ping_group lnet-router 172.16.10.254 172.16.2.254
> > #ping_group tycho-mds1 172.16.10.200 172.16.2.200
> > #respawn hacluster /usr/lib64/heartbeat/ipfail
> > 
> > However, I ran into a problem when rebooting the MDS.  Apparently if
> > one
> > partner re-establishes contact with the MDS before the other one, it
> > immediately triggers failover.  This is with heartbeat-2.1.4.
> > 
> > Jim
> > 
> > On Mon, Jul 13, 2009 at 02:25:17PM -0600, Lundgren, Andrew wrote:
> > > Were you able to get monitoring working to detect network failures?
> > (pingd?)
> > >
> > > I have it configured, but haven't been able to get it to trigger a
> > failover when an MDS cannot ping the network.  (I tried with 1.0 and
> > 2.0 conf files,  I am currently using 2.0)  I have a ticket open with
> > the pacemaker project (no ticket system for the HA stuff...)
> > > but not resolution.  I am considering writing a script to down the
> > node when the ping fails, but don't like the idea.
> > >
> > > I would also like to get the hpingd functioning to detect a fiber
> > failure, but there was less available on that solution.
> > >
> > > --
> > > Andrew
> > >
> > > > -----Original Message-----
> > > > From: Jim Garlick [mailto:garlick at llnl.gov]
> > > > Sent: Monday, July 13, 2009 2:21 PM
> > > > To: Lundgren, Andrew
> > > > Cc: Carlos Santana; lustre-discuss at lists.lustre.org
> > > > Subject: Re: [Lustre-discuss] failover software - heartbeat
> > > >
> > > > We recently put heartbeat v1 in production and along the way
> > > > developed some admin scripts including heartbeat resource agent
> > > > compliant
> > > > lustre init scripts, a script to initiate failover/failback and get
> > > > detailed
> > > > status, a powerman stonith interface, and various safeguards to
> > ensure
> > > > MMP
> > > > is on, devices are present and usable, etc. before starting lustre.
> > > >
> > > > If this is of general interest I could post it to a bug for review.
> > > >
> > > > Jim
> > > >
> > > > On Mon, Jul 13, 2009 at 01:45:02PM -0600, Lundgren, Andrew wrote:
> > > > > It is very difficult to find relevant documentation for heartbeat
> > > > 1/2. I just finished configuring a heartbeat system and would not
> > > > recommend it because of the documentation.  (They seem to have
> > removed
> > > > portions the heartbeat documentation from the site.)
> > > > >
> > > > > Pacemaker is not a simple solution to configure either. I played
> > > > briefly with the RH clustering software.  It does not directly
> > support
> > > > any FS type other than the basic ext2/ext3, and wasn't happy with a
> > > > lustre type.
> > > > >
> > > > > --
> > > > > Andrew
> > > > >
> > > > > > -----Original Message-----
> > > > > > From: lustre-discuss-bounces at lists.lustre.org [mailto:lustre-
> > > > discuss-
> > > > > > bounces at lists.lustre.org] On Behalf Of Carlos Santana
> > > > > > Sent: Monday, July 13, 2009 11:42 AM
> > > > > > To: lustre-discuss at lists.lustre.org
> > > > > > Subject: [Lustre-discuss] failover software - heartbeat
> > > > > >
> > > > > > Howdy,
> > > > > >
> > > > > > The lustre manual recommends heartbeat for handling failover.
> > The
> > > > > > pacemaker is successor of hearbeat version 2. So whats
> > recommended
> > > > -
> > > > > > should we be using pacemaker or stick to hearbeat?
> > > > > >
> > > > > > -
> > > > > > CS.
> > > > > > _______________________________________________
> > > > > > Lustre-discuss mailing list
> > > > > > Lustre-discuss at lists.lustre.org
> > > > > > http://***lists.lustre.org/mailman/listinfo/lustre-discuss
> > > > > _______________________________________________
> > > > > Lustre-discuss mailing list
> > > > > Lustre-discuss at lists.lustre.org
> > > > > http://***lists.lustre.org/mailman/listinfo/lustre-discuss



More information about the lustre-discuss mailing list