[Lustre-discuss] How do I recover files from partial lustre disk?
megan
dobsonunit at gmail.com
Wed Jun 18 14:33:51 PDT 2008
Thank you Andreas!
Your information is wonderful. I did the following:
I logged into my MDS (same as MGS) and issued the commands--
shell-prompt> mount -t lustre /dev/md1 /srv/lustre/mds/crew4-MDT0000
No errors so far.
shell-prompt> lctl
dl (Found my nids of failed JBODs)
device 14
deactivate
device 16
deactivate
quit
On one of our servers, I mounted the lustre disk /crew4.
The disk will hang a UNIX df or ls command.
However....
lfs find --ost crew4-OST0001_UUID --ost crew4-OST0003_UUID --ost crew4-
OST0004_UUID -print /crew4
Did indeed provide a list of files. I saved the list to a text
file. I will next see if I am able to copy a single file to a new
location.
Thank you again Andreas for this incredibly useful information. Do
you/Sun do paid Lustre consulting by any chance?
Later,
megan
On Jun 18, 12:48 am, Andreas Dilger <adil... at sun.com> wrote:
> On Jun 16, 2008 15:37 -0700, megan wrote:
>
> > I am using Lustre 2.6.18-53.1.13.el5_lustre.1.6.4.3smp kernel on a
> > CentOS 5 linux x86_64 linux box.
> > We had a hardware problem that caused the underlying ext3 partition
> > table to completely blow up. This is resulting in only three of five
> > OSTs being mountable. The main lustre disk of this unit cannot be
> > mounted because the MDS knows that two of its parts are missing.
>
> It should be possible to mount a Lustre filesystem with OSTs that
> are not available. However, access to files on the unavailable
> OSTs will cause the process to wait on OST recovery.
>
>
>
> > The underlying set-up is JBOD hw that is passed to the linux OS, via
> > an LSI 8888ELP card in this case, as a simple device, ie. sde,
> > sdf,... The simple devices were partitioned using parted and
> > formatted ext3 then lustre was built on top of the five ext3 units.
> > There was no striping done across units/JBODS. Three of the five
> > units passed an e2fsck and an lfsck. Those remaining units are
> > mounted as such:
> > /dev/sdc 13T 6.3T 5.7T 53% /srv/lustre/OST/crew4-
> > OST0003
> > /dev/sdd 13T 6.3T 5.7T 53% /srv/lustre/OST/crew4-
> > OST0004
> > /dev/sdf 13T 6.2T 5.8T 52% /srv/lustre/OST/crew4-
> > OST0001
>
> > Being that it is unlikely that we shall be able to recover the
> > underlying ext3 on the other two units, is there some method by which
> > I might try to rescue the data from these last three units mounted
> > currently on the OSS?
>
> > Any and all suggestion genuinely appreciated.
>
> The recoverability of your data depends heavily on the striping of
> the individual files (i.e. the default striping). If your files have
> a default stripe_count = 1, then you can probably recover 3/5 of the
> files in the filesystem. If your default stripe_count = 2, then you
> can probably only recover 1/5 of the files, and if you have a higher
> stripe_count you probably can't recover any files.
>
> What you need to do is to mount one of the clients and mark the
> corresponding OSTs inactive with:
>
> lctl dl # get device numbers for OSC 0000 and OSC 0002
> lctl --device N deactivate
>
> Then, instead of the clients waiting for the OSTs to recover the
> client will get an IO error when it accesses files on the failed OSTs.
>
> To get a list of the files that are on the good OSTs run:
>
> lfs find --ost crew4-OST0001_UUID --ost crew4-OST0003_UUID
> --ost crew4-OST0004_UUID {mountpoint}
>
> Cheers, Andreas
> --
> Andreas Dilger
> Sr. Staff Engineer, Lustre Group
> Sun Microsystems of Canada, Inc.
>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-disc... at lists.lustre.orghttp://lists.lustre.org/mailman/listinfo/lustre-discuss
More information about the lustre-discuss
mailing list