[lustre-discuss] Removing stale files {External}

Thu Jun 9 07:46:55 PDT 2022

On Thu, Jun 09, 2022 at 04:29:54AM +0000, Andreas Dilger wrote:
>How did you drain the OST?  Was the OST totally deactivated, or "max_create_count=0"?
>If it was deactivated, then this will prevent OST objects from being destroyed when the
>MDT inode is deleted.

I set max create count to 0, then I did an lfs find on the OST and ran
an lfs migrate against each file.  There were many files in the output
of lfs find that hung when accessed, so I had to use unlink to remove
those.  It tooks many cycles of find/migrate/unlink until all of the OST
was empty at the logical level.  Currently there are a litle over 5
millions files visible in ldiskfs under O/, and 10 more files in the
destroys_in_flight on the MDS than there are file objects under O.

>This is normal.  The number is the object ID.  You can check the OST objects with
>ll_decode_filter_fid (when mounted as type ldiskfs) to report the parent MDT FID
>that the object belongs/belonged to.  Then "lfs fid2path" can be used to check if the
>file still exists and/or if the OST object is still part of the layout (which it should not be).

The one time I ran fid2path against a corrupt file the MDS crashed.  I'm
a little afraid to try that.  But ll_decode_filter_fid is an interesting
option.  I have two FIDs that pop out of syslogs as being in error, and
finding those files might be helpful.  One of the FIDs gives me two
object ids that I can find, but when I look at the ldiskfs I find a
total of 160 files that appear to be that same file.  The other fid that
pops up in the logs I have no idea about, but if I can use
ll_decode_filter_fid to track down the object ids that the FID is
refering to, then I can delete those as well.

After reading your email, we are going to try finding the object for the
FID that doesn't report its objects and removing that, as well as the
two objects that I know are duplicates, and see what happens.  We are
hoping that if we can solve the first errors that the kernel is
producing that the rest of the filesystem will self-heal itself like it
is supposed to.

Your email gave us a lot of confidence in what we are trying to do to
fix this, so thank you for the reply.  If we can ever lure you back to
the DSOC/NRAO to give another talk I'll thank you in person!  The
restraurant we took you to last time has closed, but we have some other
options in town.

--Schlake
  Sysadmin IV, NRAO
  Work: 575-835-7281 (BACK IN THE OFFICE!)
  Cell: 575-517-5668 (out of work hours)