[lustre-discuss] dealing with maybe dead OST
adilger at whamcloud.com
Wed Jun 20 10:39:33 PDT 2018
On Jun 19, 2018, at 09:33, Robin Humble <rjh+lustre at cita.utoronto.ca> wrote:
> so we've maybe lost 1 OST out of a filesystem with 115 OSTs. we may
> still be able to get the OST back, but it's been a month now so
> there's pressure to get the cluster back and working and leave the
> files missing for now...
> the complication is that because the OST might come back to life we
> would like to avoid the users rm'ing their broken files and potentially
> deleting them forever.
> lustre is 2.5.41 ldiskfs centos6.x x86_64.
> ideally I think we'd move all the ~2M files on the OST to a root access
> only "shadow" directory tree in lustre that's populated purely with
> files from the dead OST.
> if we manage to revive the OST then these can magically come back to
> life and we can mv them back into their original locations.
> but currently
> mv: cannot stat 'some_file': Cannot send after transport endpoint shutdown
> the OST is deactivated on the client. the client hangs if the OST isn't
> deactivated. the OST is still UP & activated on the MDS.
> is there a way to mv files when their OST is unreachable?
> seems like mv is an MDT operation so it should be possible somehow?
This is a problem purely of GNU fileutil's invention. It is very stat()
happy and will stat() a file and its parent directory several times during
mv, cp, rm, etc. "just to make sure" rather than going ahead and just
trying the operation. You can see this by running "strace mv <src> <tgt>",
especially if they are in different directories.
I don't think there is a low-level "rename" tool that is like "unlink"
that will just do the rename() call without all of the overhead. The
"rename" command is (AFAICS) meant to rename a batch of files with some
common substring in the filename (like "rename foo bar foo*.txt").
In the Lustre source tree there is a very simple C program that only calls
without doing stat() or anything else. This is lustre/tests/mrename.c
that you could use together with "lfs find", something like:
mkdir -p .broken_ost0012
lfs find . -type f --ost myfs-OST0012 |
while read F; do
mkdir -p ".broken_ost0012/$(dirname "$F")"
mrename "$F" ".broken_ost0012/$F"
(this is completely untested, but something similar should work).
> the only thing I've thought of seems pretty out there...
> mount the MDT as ldiskfs and mv the affected files into the shadow
> tree at the ldiskfs level.
> ie. with lustre running and mounted, create an empty shadow tree of
> all dirs under eg. /lustre/shadow/, and then at the ldiskfs level on
> the MDT:
> for f in <list_of_2m_files>; do
> mv /mnt/mdt0/ROOT/$f /mnt/mdt0/ROOT/shadow/$f
> would that work?
This would work to some degree, but the "link" xattr on each file
would not be updated, so "lfs fid2path" would be broken until a
full LFSCK is run.
> maybe we'd also have to rebuild OI's and lfsck - something along the
> lines of the MDT restore procedure in the manual. hopefully that would
> all work with an OST deactivated.
> alternatively, should we just unlink all the currently dead files from
> lustre now, and then if the OST comes back can we reconstruct the paths
> and filenames from the FID in xattrs's on the revived OST?
> I suspect unlink is final though and this wouldn't work... ?
That would be possible, but overly complex, since the inodes would be
removed from the MDT and you'd need to reconstruct them with LFSCK and
find the names, as LFSCK would dump them all into $MNT/.lustre/lost+found.
> we can also take an lvm snapshot of the MDT and refer to that later I
> suppose, but I'm not sure how that might help us.
It should be possible to copy the unlinked files from the backup MDT
to the current MDT (via ldiskfs), along with an LFSCK run to rebuild
the OI files. It is always a good idea to have an MDT device-level
backup before you do anything drastic like this. However, for the
meantime I think that renaming the broken files to a root-only directory
is the safest.
Principal Lustre Architect
-------------- next part --------------
A non-text attachment was scrubbed...
Size: 235 bytes
Desc: Message signed with OpenPGP
More information about the lustre-discuss