[Lustre-discuss] lustre 1.6.5.1 panic on failover

Andreas Dilger adilger at sun.com
Fri Aug 1 12:56:51 PDT 2008


On Aug 01, 2008  11:39 -0400, Brock Palen wrote:
> That will work right, the machine cycles the second takes over and  
> all is well.
> 
> If instead of crashing the node I run 'killall -9 heartbeat'
> I can get the panic every time.  I even edited the external/ipmi  
> script from 'power reset' to 'power cycle' didn't help.
> 
> Its kinda unstable, if heartbeat dies the who MDS/mgs server setup  
> would lock up, if the server panics I will be ok.  I don't like this  
> spot.
> 
> I am looking at grabbing a crash dump. I think its a race, heartbeat  
> is mounting the filesystems before the first node is toatally dead.
> 
> Does it hurt to run mmp on the mgs file system also?

It should be fine to run MMP on any ldiskfs filesystem even if not
in failover mode.  The only drawback is a minor delay (20s or something)
in mounting and e2fsck startup.

> On Jul 31, 2008, at 5:28 PM, Klaus Steden wrote:
> >
> > Hi Brock,
> >
> > I've been using Sun X2200s with Lustre in a similar configuration  
> > (IPMI,
> > STONITH, Linux-HA, FC storage) and haven't had any issues like this
> > (although I would typically panic the primary node during testing  
> > using
> > Sysrq) ... is the behaviour consistent?
> >
> > Klaus
> >
> > On 7/31/08 1:57 PM, "Brock Palen" <brockp at umich.edu>did etch on stone
> > tablets:
> >
> >> I have two machines I am setting up as my first mds failover pair.
> >>
> >> The two sun x4100's  are connected to a FC disk array.  I have set up
> >> heartbeat with IPMI for STONITH.
> >>
> >> Problem is when I run a test on the host that currently has the mds/
> >> mgs mounted  'killall -9 heartbeat'  I see the IPMI shutdown and when
> >> the second 4100 tries to mount the filesystem it does a kernel panic.
> >>
> >> Has anyone else seen this behavior?  Is there something I am running
> >> into?  If I do a 'hb_takelover' or shutdown heartbeat cleanly all is
> >> well.  Only if I simulate heartbeat failing does this happen.  Note I
> >> have not tired yanking power yet, but I want to simulate a MDS in a
> >> semi dead state and ran into this.
> >>
> >>
> >> Brock Palen
> >> www.umich.edu/~brockp
> >> Center for Advanced Computing
> >> brockp at umich.edu
> >> (734)936-1985
> >>
> >>
> >>
> >> _______________________________________________
> >> Lustre-discuss mailing list
> >> Lustre-discuss at lists.lustre.org
> >> http://lists.lustre.org/mailman/listinfo/lustre-discuss
> >
> >
> >
> 
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.




More information about the lustre-discuss mailing list