[Lustre-discuss] How do I recover files from partial lustre disk?
Andreas Dilger
adilger at sun.com
Tue Jun 17 21:48:29 PDT 2008
On Jun 16, 2008 15:37 -0700, megan wrote:
> I am using Lustre 2.6.18-53.1.13.el5_lustre.1.6.4.3smp kernel on a
> CentOS 5 linux x86_64 linux box.
> We had a hardware problem that caused the underlying ext3 partition
> table to completely blow up. This is resulting in only three of five
> OSTs being mountable. The main lustre disk of this unit cannot be
> mounted because the MDS knows that two of its parts are missing.
It should be possible to mount a Lustre filesystem with OSTs that
are not available. However, access to files on the unavailable
OSTs will cause the process to wait on OST recovery.
> The underlying set-up is JBOD hw that is passed to the linux OS, via
> an LSI 8888ELP card in this case, as a simple device, ie. sde,
> sdf,... The simple devices were partitioned using parted and
> formatted ext3 then lustre was built on top of the five ext3 units.
> There was no striping done across units/JBODS. Three of the five
> units passed an e2fsck and an lfsck. Those remaining units are
> mounted as such:
> /dev/sdc 13T 6.3T 5.7T 53% /srv/lustre/OST/crew4-
> OST0003
> /dev/sdd 13T 6.3T 5.7T 53% /srv/lustre/OST/crew4-
> OST0004
> /dev/sdf 13T 6.2T 5.8T 52% /srv/lustre/OST/crew4-
> OST0001
>
> Being that it is unlikely that we shall be able to recover the
> underlying ext3 on the other two units, is there some method by which
> I might try to rescue the data from these last three units mounted
> currently on the OSS?
>
> Any and all suggestion genuinely appreciated.
The recoverability of your data depends heavily on the striping of
the individual files (i.e. the default striping). If your files have
a default stripe_count = 1, then you can probably recover 3/5 of the
files in the filesystem. If your default stripe_count = 2, then you
can probably only recover 1/5 of the files, and if you have a higher
stripe_count you probably can't recover any files.
What you need to do is to mount one of the clients and mark the
corresponding OSTs inactive with:
lctl dl # get device numbers for OSC 0000 and OSC 0002
lctl --device N deactivate
Then, instead of the clients waiting for the OSTs to recover the
client will get an IO error when it accesses files on the failed OSTs.
To get a list of the files that are on the good OSTs run:
lfs find --ost crew4-OST0001_UUID --ost crew4-OST0003_UUID
--ost crew4-OST0004_UUID {mountpoint}
Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.
More information about the lustre-discuss
mailing list