[Lustre-discuss] Failed OST Cleanup
Bernd Schubert
bs_lists at aakef.fastmail.fm
Wed Jun 2 14:12:45 PDT 2010
On Wednesday 02 June 2010, Andreas Dilger wrote:
> On 2010-06-02, at 11:54, Scott Barber wrote:
> > I'm now trying to get a list of files that are now corrupt. On one of
> > the lustre clients I'm running:
> > lfs find --obd sanvol06-OST0013_UUID <my lustre mount point>
> >
> > It starts to list files and then a few minutes later it runs into an
> > error and stops:
> > cb_find_init: IOC_LOV_GETINFO on <filename> failed: Input/output error.
> >
> > In dmesg I see:
> > LustreError: 13926:0:(file.c:1053:ll_glimpse_size()) obd_enqueue
> > returned rc -5, returning -EIO
> >
> > The file that gets that "Input/output error" cannot be delete or
> > removed from the file system. How can I get around this?
>
> There is a bug in "lfs find" that it tries to get the file size
> unnecessarily. You can use "lfs getstripe -obd ..." instead, and it
> should work even if the OST is down.
Hmm, yes and no. In principle I like the idea that lfs find tries to figure
out the file size. A couple of years ago I had to deal with 3 disk failure of
raid6 and although we tried to clone the 3rd failing disk, in the end we lost
that OST. Now there was stripe size of 4M and a stripe count of 4 configured.
When I then run 'lfs find' to find files located on that OST, it reported lots
of file, that *would* have data on that OST, if the file would have
sufficiently large. But then lots of files had been smaller than 1M and so it
would have been wrong to delete those files. It turned out that 'lfs find' was
rather useless for us and I simply had to read each file - if read succeeded
all was fine, it it failed I moved it into a dedicated subdirectory. The
missing OST later on was recreated (that was more easy that time with 1.4 than
nowadays) and we only lost a small part of the file, definitely much less than
what 'lfs find' suggested.
So if 'lfs find' now used the filesize to determine if a file is really
located on an OST, that would be an improvement. Of course, if it fails at all
with an IO error, it is also not useful ;)
Cheers,
Bernd
--
Bernd Schubert
DataDirect Networks
More information about the lustre-discuss
mailing list