[Lustre-discuss] Lustre client question

Fri May 13 10:31:58 PDT 2011

We recently had two raid rebuilds on a couple storage targets that did not 
go according to plan. The cards reported a successful rebuild in 
each case, but ldiskfs errors started showing up on the associated OSSs 
and the effected OSTs were  remounted read-only. We are planning to 
migrate off the data, but we've noticed that some clients are getting i/o 
errors, while others are not. As an example, a file that has a stripe on 
at least one affected OST could not be read on one client, i.e. I 
received a read-error trying to access it, while it was perfectly 
readable and apparently uncorrupted on another (I am able to migrate the 
file to healthy OSTs by copying to a new file name). The clients 
with the i/o problem see inactive devices corresponding to the read-only 
OSTs when I issue a 'lfs df', while the others without the i/o problems 
report the targets as normal. Is it just that many clients are not aware 
of an OST problem yet? I need clients with minimal I/O disruptions in 
order to migrate as much data off as possible.

A client reboot appears to awaken them to the fact that there are problems 
with the OSTs. However, I need them to be able to read the data in order 
to migrate it off. Is there a way to reconnect the clients to the 
problematic OSTs?

We have dd-ed copies of the OSTs to try e2fsck against them, but the 
results were not promising. The check aborted with:

------
Resize inode (re)creation failed: A block group is missing an inode 
table.Continue? yes

ext2fs_read_inode: A block group is missing an inode table while reading 
inode 7 in recreate inode
e2fsck: aborted
------

Any advice would be greatly appreciated.
Zach