[Lustre-discuss] Lustre Mount Crashing

Tue Jun 3 13:55:35 PDT 2008

On Jun 03, 2008  16:37 -0400, Charles Taylor wrote:
> I'm sorry, I should have updated you.  You are right, it was  
> misleading.    The MDS/MDT was fine and  after about twenty minutes or  
> so everything became active and we now have a working file system with  
> data that we can access so we can't *thank you* enough.

You're welcome.

> BTW, That's a pretty obscure "fix".   I was going to ask for an  
> explanation but we've been pretty busy doing fsck's and lfsck's (which  
> we are still working up to since it takes a while to generate the  
> db's).    It is a pretty slow process but things are looking  
> relatively good.   Of course, when you go from thinking you just lost  
> all your data to having almost all of it, anything looks pretty  
> good.  :)
> 
> PS - we know refer to your commands to truncate the last_rcvd file as  
> the "Dilger Procedure" (with great reverence).  :)

Well, by no means should this be a normal process.  If you can spare the
time after your system is back in shape, then copying the last_rcvd.sav
file to a test MDT and mounting it with a serial console enabled would
help track down what the root cause of this is.  The fewer people that
have to perform the "Dilger Procedure" the better.

> On Jun 3, 2008, at 4:20 PM, Andreas Dilger wrote:
> > On Jun 02, 2008  19:51 -0400, Charles Taylor wrote:
> >> Wow, you are one powerful witch doctor.     So we rebuilt our  
> >> system disk
> >> (just to be sure) and that made no difference we still panicked as  
> >> soon as
> >> mounted the MDT.   The "-o abort_recov" did not help either.    
> >> However,
> >> your recipe below worked wonders....almost.     Now we can mount  
> >> the MDT
> >> but it does not go into recovery.     It just shows as  
> >> "inactive".     We
> >> are so close, I can taste it but what are we doing wrong now?
> >>
> >>
> >> [root at hpcmds lustre]# cat /proc/fs/lustre/mds/ufhpc-MDT0000/ 
> >> recovery_status
> >> status: INACTIVE
> >>
> >>
> >> Which tire do we kick now?   :)
> >
> > Well, deleting the tail of the last_rcvd file is the "hard" way to  
> > tell
> > the MDT/OST it is no longer in recovery...  The deleted part of the  
> > file
> > is where the per-client state is kept, so when it is removed the MDT
> > decides no recovery is needed.
> >
> > The "recovery_status" being "INACTIVE" is somewhat misleading.  It  
> > means
> > "no recovery is currently active", but the MDT is up and you should be
> > able to use it, with the caveat that clients previously doing  
> > operations
> > will get an IO error for in-flight operations before they start  
> > afresh...
> > However, you said the clients are powered off, so they probably aren't
> > busy doing anything...
> >
> > If you had a more complete stack trace it would be useful to determine
> > what is actually going wrong with the mount.
> >
> >> On Jun 2, 2008, at 3:36 PM, Andreas Dilger wrote:
> >>> If mounting with "-o abort_recovery" doesn't solve the problem,
> >>> are you able to mount the MDT filesystem as "-t ldiskfs" instead of
> >>> "-t lustre"?  Try that, then copy and truncate the last_rcvd file:
> >>>
> >>> 	mount -t ldiskfs /dev/MDSDEV /mnt/mds
> >>> 	cp /mnt/mds/last_rcvd /mnt/mds/last_rcvd.sav
> >>> 	cp /mnt/mds/last_rcvd /tmp/last_rcvd.sav
> >>> 	dd if=/mnt/mds/last_rcvd.sav of=/mnt/mds/last_rcvd bs=8k count=1
> >>> 	umount /mnt/mds
> >>>
> >>> 	mount -t lustre /dev/MSDDEV /mnt/mds
> >>>
> >>> Cheers, Andreas
> >>> --
> >>> Andreas Dilger
> >>> Sr. Staff Engineer, Lustre Group
> >>> Sun Microsystems of Canada, Inc.
> >>>
> >
> > Cheers, Andreas
> > --
> > Andreas Dilger
> > Sr. Staff Engineer, Lustre Group
> > Sun Microsystems of Canada, Inc.
> >
> 
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.