I had the joy of taking this one apart personally.  We mostly let lfsck do the repair and moved on, accepting that some of the dentries were trashed.  I think, for important things, our field staff did some manual recovery with the e2fsprogs tools, but it was not a common enough problem that we documented a procedure.

If you read LU-5626 carefully, there's an explanation of the exact nature of the damage, and having that should let you make partial recoveries by hand.  I'm not familiar with the ll_recover_lost_found_objs tool, but I doubt it would prove helpful in this instance.

Note that there's two forms to this corruption.  One is if you move a directory which was created before dirdata was enabled, then the '..' entry ends up in the wrong place.  This does not trouble Lustre, but fsck reports it as an error and will 'correct' it, which has the effect of (usually) overwriting one dentry in the directory when it creates a new '..' dentry in the correct location.

I don't *think* that one causes the MDT to go read only, but I could be wrong.  I *think* what causes the MDT to go read only is the other problem:

When you have a non-htree directory (not too many items in it, all directory entries in a single inode) that is in the bad state described above (with the '..' dentry in the wrong place after being moved) and that directory has enough files added to it that it becomes an htree directory, the resulting directory is corrupted more severely.  We never sorted out the precise details of this - I believe we chose to simply delete any directories in this state.  (I think lfsck did it for us, but can't recall for sure.)

I'd advise reading LU-5626 with care, and I'd also suggest you might turn off 'dirdata' on your MDT until you have this under control.  That will at least prevent any more directories from ending up in either of these bad states if you use the filesystem without updating Lustre to a version with the LU-5626 patch in it.

We have a lustre 1.8 filesystem that was upgraded to lustre 2.x and
"dirdata" feature was enabled. We encountered LU-5626/LU-2638 issue with
".." directory entries. Are there established recovery steps for this
issue ?

If I run fsck, the directory entries will be moved into lost+found.
I assume the next step is to run the ll_recover_lost_found_objs tool ?

Can you share any advice/experience about recovery ?

