[Lustre-discuss] Lustre Mount Crashing

Tue Jun 3 13:37:45 PDT 2008

I'm sorry, I should have updated you.  You are right, it was  
misleading.    The MDS/MDT was fine and  after about twenty minutes or  
so everything became active and we now have a working file system with  
data that we can access so we can't *thank you* enough.

BTW, That's a pretty obscure "fix".   I was going to ask for an  
explanation but we've been pretty busy doing fsck's and lfsck's (which  
we are still working up to since it takes a while to generate the  
db's).    It is a pretty slow process but things are looking  
relatively good.   Of course, when you go from thinking you just lost  
all your data to having almost all of it, anything looks pretty  
good.  :)

Thanks again for your help,

Charlie Taylor
UF HPC Center

PS - we know refer to your commands to truncate the last_rcvd file as  
the "Dilger Procedure" (with great reverence).  :)

ct

On Jun 3, 2008, at 4:20 PM, Andreas Dilger wrote:

> On Jun 02, 2008  19:51 -0400, Charles Taylor wrote:
>> Wow, you are one powerful witch doctor.     So we rebuilt our  
>> system disk
>> (just to be sure) and that made no difference we still panicked as  
>> soon as
>> mounted the MDT.   The "-o abort_recov" did not help either.    
>> However,
>> your recipe below worked wonders....almost.     Now we can mount  
>> the MDT
>> but it does not go into recovery.     It just shows as  
>> "inactive".     We
>> are so close, I can taste it but what are we doing wrong now?
>>
>>
>> [root at hpcmds lustre]# cat /proc/fs/lustre/mds/ufhpc-MDT0000/ 
>> recovery_status
>> status: INACTIVE
>>
>>
>> Which tire do we kick now?   :)
>
> Well, deleting the tail of the last_rcvd file is the "hard" way to  
> tell
> the MDT/OST it is no longer in recovery...  The deleted part of the  
> file
> is where the per-client state is kept, so when it is removed the MDT
> decides no recovery is needed.
>
> The "recovery_status" being "INACTIVE" is somewhat misleading.  It  
> means
> "no recovery is currently active", but the MDT is up and you should be
> able to use it, with the caveat that clients previously doing  
> operations
> will get an IO error for in-flight operations before they start  
> afresh...
> However, you said the clients are powered off, so they probably aren't
> busy doing anything...
>
> If you had a more complete stack trace it would be useful to determine
> what is actually going wrong with the mount.
>
>> On Jun 2, 2008, at 3:36 PM, Andreas Dilger wrote:
>>> If mounting with "-o abort_recovery" doesn't solve the problem,
>>> are you able to mount the MDT filesystem as "-t ldiskfs" instead of
>>> "-t lustre"?  Try that, then copy and truncate the last_rcvd file:
>>>
>>> 	mount -t ldiskfs /dev/MDSDEV /mnt/mds
>>> 	cp /mnt/mds/last_rcvd /mnt/mds/last_rcvd.sav
>>> 	cp /mnt/mds/last_rcvd /tmp/last_rcvd.sav
>>> 	dd if=/mnt/mds/last_rcvd.sav of=/mnt/mds/last_rcvd bs=8k count=1
>>> 	umount /mnt/mds
>>>
>>> 	mount -t lustre /dev/MSDDEV /mnt/mds
>>>
>>> Cheers, Andreas
>>> --
>>> Andreas Dilger
>>> Sr. Staff Engineer, Lustre Group
>>> Sun Microsystems of Canada, Inc.
>>>
>
> Cheers, Andreas
> --
> Andreas Dilger
> Sr. Staff Engineer, Lustre Group
> Sun Microsystems of Canada, Inc.
>