[Lustre-discuss] Lustre client question

Fri May 13 13:55:24 PDT 2011

Yes, the clients appear to have recovered. I didn't want to risk an fsck 
until a new file level backup was completed --- this will take time given 
the size of our system.

I've done at least 5 or 6 raid rebuilds in the past without issue using 
these raid cards. We will try to isolate the cause to this problem 
further --- i.e. perhaps a bad batch of spare drives, buggy raid driver 
(I think this is a newer Lustre version), etc.

Many thanks for your help.
Zach.

> It sounds like it is working better.  Did the clients recover?  I would have 
> re-run fsck before mounting it again, and moving the data off may still be 
> the best plan.  Since dropping the rebuilt drive reduced the corruption, 
> certainly contact your raid vendor over this issue.
>
> Kevin
>
>
> Zachary Beebleson wrote:
>>  Kevin,
>>
>>  I just failed the drive and remounted. A basic 'df' hangs when it gets to
>>  the mount point, but /proc/fs/lustre/health_check reports everything is
>>  healthy. 'lfs df' on a client reports the OST is active, where it was
>>  inactive before. However, now I'm working with a degraded volume, but it
>>  is raid 6. Should I try another rebuild or just proceed with the
>>  mirgration off of this OST asap?
>>
>>  Thanks,
>>  Zach
>>
>>  PS. Sorry for the repeat message
>>  On Fri, 13 May 2011, Kevin Van Maren wrote:
>> 
>> >  See bug 24264 -- certainly possible that the raid controller corrupted 
>> >  your filesystem.
>> > 
>> >  If you remove the new drive and reboot, does the file system look 
>> >  cleaner?
>> > 
>> >  Kevin
>> > 
>> > 
>> >  On May 13, 2011, at 11:39 AM, Zachary Beebleson 
>> >  <zbeeble at math.uchicago.edu> wrote:
>> > 
>> > > 
>> > >  We recently had two raid rebuilds on a couple storage targets that did 
>> > >  not go
>> > >  according to plan. The cards reported a successful rebuild in each 
>> > >  case, but
>> > >  ldiskfs errors started showing up on the associated OSSs and the 
>> > >  effected OSTs
>> > >  were  remounted read-only. We are planning to migrate off the data, 
>> > >  but we've
>> > >  noticed that some clients are getting i/o errors, while others are 
>> > >  not. As an
>> > >  example, a file that has a stripe on at least one affected OST could 
>> > >  not be
>> > >  read on one client, i.e. I received a read-error trying to access it, 
>> > >  while it
>> > >  was perfectly readable and apparently uncorrupted on another (I am 
>> > >  able to
>> > >  migrate the file to healthy OSTs by copying to a new file name). The 
>> > >  clients
>> > >  with the i/o problem see inactive devices corresponding to the 
>> > >  read-only OSTs
>> > >  when I issue a 'lfs df', while the others without the i/o problems 
>> > >  report the
>> > >  targets as normal. Is it just that many clients are not aware of an 
>> > >  OST problem
>> > >  yet? I need clients with minimal I/O disruptions in order to migrate 
>> > >  as much
>> > >  data off as possible.
>> > > 
>> > >  A client reboot appears to awaken them to the fact that there are 
>> > >  problems with
>> > >  the OSTs. However, I need them to be able to read the data in order to 
>> > >  migrate
>> > >  it off. Is there a way to reconnect the clients to the problematic 
>> > >  OSTs?
>> > > 
>> > >  We have dd-ed copies of the OSTs to try e2fsck against them, but the 
>> > >  results
>> > >  were not promising. The check aborted with:
>> > > 
>> > >  ------
>> > >  Resize inode (re)creation failed: A block group is missing an inode
>> > >  table.Continue? yes
>> > > 
>> > >  ext2fs_read_inode: A block group is missing an inode table while 
>> > >  reading inode
>> > >  7 in recreate inode
>> > >  e2fsck: aborted
>> > >  ------
>> > > 
>> > >  Any advice would be greatly appreciated.
>> > >  Zach
>> > >  _______________________________________________
>> > >  Lustre-discuss mailing list
>> > >  Lustre-discuss at lists.lustre.org
>> > >  http://lists.lustre.org/mailman/listinfo/lustre-discuss
>> > 
>