[Lustre-discuss] Lustre Mount Crashing
Andreas Dilger
adilger at sun.com
Tue Jun 3 13:55:35 PDT 2008
On Jun 03, 2008 16:37 -0400, Charles Taylor wrote:
> I'm sorry, I should have updated you. You are right, it was
> misleading. The MDS/MDT was fine and after about twenty minutes or
> so everything became active and we now have a working file system with
> data that we can access so we can't *thank you* enough.
You're welcome.
> BTW, That's a pretty obscure "fix". I was going to ask for an
> explanation but we've been pretty busy doing fsck's and lfsck's (which
> we are still working up to since it takes a while to generate the
> db's). It is a pretty slow process but things are looking
> relatively good. Of course, when you go from thinking you just lost
> all your data to having almost all of it, anything looks pretty
> good. :)
>
> PS - we know refer to your commands to truncate the last_rcvd file as
> the "Dilger Procedure" (with great reverence). :)
Well, by no means should this be a normal process. If you can spare the
time after your system is back in shape, then copying the last_rcvd.sav
file to a test MDT and mounting it with a serial console enabled would
help track down what the root cause of this is. The fewer people that
have to perform the "Dilger Procedure" the better.
> On Jun 3, 2008, at 4:20 PM, Andreas Dilger wrote:
> > On Jun 02, 2008 19:51 -0400, Charles Taylor wrote:
> >> Wow, you are one powerful witch doctor. So we rebuilt our
> >> system disk
> >> (just to be sure) and that made no difference we still panicked as
> >> soon as
> >> mounted the MDT. The "-o abort_recov" did not help either.
> >> However,
> >> your recipe below worked wonders....almost. Now we can mount
> >> the MDT
> >> but it does not go into recovery. It just shows as
> >> "inactive". We
> >> are so close, I can taste it but what are we doing wrong now?
> >>
> >>
> >> [root at hpcmds lustre]# cat /proc/fs/lustre/mds/ufhpc-MDT0000/
> >> recovery_status
> >> status: INACTIVE
> >>
> >>
> >> Which tire do we kick now? :)
> >
> > Well, deleting the tail of the last_rcvd file is the "hard" way to
> > tell
> > the MDT/OST it is no longer in recovery... The deleted part of the
> > file
> > is where the per-client state is kept, so when it is removed the MDT
> > decides no recovery is needed.
> >
> > The "recovery_status" being "INACTIVE" is somewhat misleading. It
> > means
> > "no recovery is currently active", but the MDT is up and you should be
> > able to use it, with the caveat that clients previously doing
> > operations
> > will get an IO error for in-flight operations before they start
> > afresh...
> > However, you said the clients are powered off, so they probably aren't
> > busy doing anything...
> >
> > If you had a more complete stack trace it would be useful to determine
> > what is actually going wrong with the mount.
> >
> >> On Jun 2, 2008, at 3:36 PM, Andreas Dilger wrote:
> >>> If mounting with "-o abort_recovery" doesn't solve the problem,
> >>> are you able to mount the MDT filesystem as "-t ldiskfs" instead of
> >>> "-t lustre"? Try that, then copy and truncate the last_rcvd file:
> >>>
> >>> mount -t ldiskfs /dev/MDSDEV /mnt/mds
> >>> cp /mnt/mds/last_rcvd /mnt/mds/last_rcvd.sav
> >>> cp /mnt/mds/last_rcvd /tmp/last_rcvd.sav
> >>> dd if=/mnt/mds/last_rcvd.sav of=/mnt/mds/last_rcvd bs=8k count=1
> >>> umount /mnt/mds
> >>>
> >>> mount -t lustre /dev/MSDDEV /mnt/mds
> >>>
> >>> Cheers, Andreas
> >>> --
> >>> Andreas Dilger
> >>> Sr. Staff Engineer, Lustre Group
> >>> Sun Microsystems of Canada, Inc.
> >>>
> >
> > Cheers, Andreas
> > --
> > Andreas Dilger
> > Sr. Staff Engineer, Lustre Group
> > Sun Microsystems of Canada, Inc.
> >
>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.
More information about the lustre-discuss
mailing list