[lustre-discuss] Removing stale files

William D. Colburn wcolburn at nrao.edu
Tue May 31 12:01:57 PDT 2022


We had a filesystem corruption back in February, and we've been trying
to salvage things since then.  I've spent the past month slowly draining
the corrupt OST, and over the weekend it finally finished.  An lfs find
on the filesystem says that there are no files stored on that OST.  The
OST is 100% full, and if I mount it as an ldiskfs I can see a little
over five millions files in O/*/*.  Most of them have numbers as names,
and some of them are named LAST_ID.  All of the numbered files seem to
be user data, with owners, and real data in them (based on ls and the
find command)

I would like to clean out this OST and readd it to lustre, but I'm
unsure of how to best approach this.  I see several options:

OPTION ONE: run lfsck against the entire filesystem with the full and
previously corrupt OST mounted.

OPTION TWO: run lfsck against only the corrupt OST in the hopes that
cleans up all of the orphans on that OST.

OPTION THREE: mounted as ldiskfs remove O/*/[1234567890]*[1234567890]
and then remount the file system.

OPTION FOUR: newfs the bad OST and readd it losing the old index.

We tried option one once before, and it killed cluster jobs because it
made files unreadable while they were in use.  Option two might avoid
that since it would not be affecting existing files.  Option three
sounds like it will work based on my limited knowledge of how lustre
works, and would probably be the most expedient method.  Option four is
annoying because it leaves a hole in the lustre that is upsetting to our
OCD tendencies.

Any and all advice is appreciated here.  Thank you.

--Schlake
  Sysadmin IV, NRAO
  Work: 575-835-7281 (BACK IN THE OFFICE!)
  Cell: 575-517-5668 (out of work hours)


More information about the lustre-discuss mailing list