[lustre-discuss] recovery MDT ".." directory entries (LU-5626)

Patrick Farrell paf at cray.com
Tue Nov 3 09:30:09 PST 2015


Hm.  That's almost, but not quite, right.  Disabling dirdata during the 
fsck run has no positive effect - fsck will still get upset about the 
incorrectly placed entry.  (And whether or not dirdata is enabled, fsck 
will do the same thing.  It doesn't know or care about the dirdata 
setting as such.)

Steps #1 and #2 will not cause any problems until you run fsck, but 
there's no way around the issue once you do run fsck.  The .. dentry 
must go back to the correct location to make fsck happy.  If I remember 
right, fsck creates the .. dentry and doesn't include the fid 
(regardless of dirdata setting).  This can overwrite another dentry if 
one has been placed in the location normally reserved for the .. dentry 
(which can happen if the dentry which was after the .. dentry is 
deleted, thereby making a space large enough for a dentry+FID).

Furthermore, if you have a non-Htree directory where the .. dentry is 
incorrectly placed (your steps 1 & 2), then you add files until it 
shifts to become an HTree directory, THAT directory becomes corrupted in 
a more severe manner that will cause your MDT to remount read only 
and/or LBUG.  (LU-2638 only fixes the .. dentry bug for HTree 
directories themselves.  It does not help with a corrupted directory 
that then becomes an HTree directory.)

- Patrick

On 11/03/2015 11:17 AM, Mohr Jr, Richard Frank (Rick Mohr) wrote:
>> On Oct 27, 2015, at 1:46 PM, Patrick Farrell <paf at cray.com> wrote:
>>
>> That's something of a time bomb - If one of those directories fsck wishes it could correct is small and grows in number of files, you'll get the MDT going read only (and a few odd LBUGs if you try to put it back).
> I was looking back over the incident where I thought I had hit this bug, but based on the lack of side effects that you mentioned, I am now starting to think that I was mistaken.  Nevertheless, I am trying to understand the bug a little better in case I am still susceptible to it.  I tried to summarize my understanding below, and maybe you can tell me if I am correct.
>
> For HTree directories, the problem is described in LU-2638.  But since I am running Lustre >2.4, I should not be affected by this bug.
>
> For non-Tree directories, the problem is described in LU-5626.  In order to trigger the bug, the following steps must happen:
>
> 1) A non-HTree directory created under Lustre 1.8 (which does not have a FID for its “..” entry) gets moved to a different parent directory.
>
> 2) Lustre tries to update the “..” entry in the directory, and if there is not enough space in the existing entry, it creates a new “..” entry and adds the FID.
>
> 3) Something happens to the MDT, and fsck needs to be run.  When it runs, it notices that “..” is no longer the second entry in the directory.
>
> 4) fsck tries to “fix” the problem by moving the “..” entry back to its original position.  With the FID in place, there is not enough space in the original position, but fsck moves it anyway which causes the “..” entry to overwrite part of the third entry in the directory.
>
> If that is correct, then steps #1 and #2 can happen without causing any problems.  It is only at steps #3 and #4 that the corruption occurs, and as long as dirdata is disabled before fsck is run, then there should not be any problems.
>
> Is that explanation accurate?
>
> --
> Rick Mohr
> Senior HPC System Administrator
> National Institute for Computational Sciences
> http://www.nics.tennessee.edu
>



More information about the lustre-discuss mailing list