[Lustre-discuss] Lustre client question
Zachary Beebleson
zbeeble at math.uchicago.edu
Fri May 13 13:55:24 PDT 2011
Yes, the clients appear to have recovered. I didn't want to risk an fsck
until a new file level backup was completed --- this will take time given
the size of our system.
I've done at least 5 or 6 raid rebuilds in the past without issue using
these raid cards. We will try to isolate the cause to this problem
further --- i.e. perhaps a bad batch of spare drives, buggy raid driver
(I think this is a newer Lustre version), etc.
Many thanks for your help.
Zach.
> It sounds like it is working better. Did the clients recover? I would have
> re-run fsck before mounting it again, and moving the data off may still be
> the best plan. Since dropping the rebuilt drive reduced the corruption,
> certainly contact your raid vendor over this issue.
>
> Kevin
>
>
> Zachary Beebleson wrote:
>> Kevin,
>>
>> I just failed the drive and remounted. A basic 'df' hangs when it gets to
>> the mount point, but /proc/fs/lustre/health_check reports everything is
>> healthy. 'lfs df' on a client reports the OST is active, where it was
>> inactive before. However, now I'm working with a degraded volume, but it
>> is raid 6. Should I try another rebuild or just proceed with the
>> mirgration off of this OST asap?
>>
>> Thanks,
>> Zach
>>
>> PS. Sorry for the repeat message
>> On Fri, 13 May 2011, Kevin Van Maren wrote:
>>
>> > See bug 24264 -- certainly possible that the raid controller corrupted
>> > your filesystem.
>> >
>> > If you remove the new drive and reboot, does the file system look
>> > cleaner?
>> >
>> > Kevin
>> >
>> >
>> > On May 13, 2011, at 11:39 AM, Zachary Beebleson
>> > <zbeeble at math.uchicago.edu> wrote:
>> >
>> > >
>> > > We recently had two raid rebuilds on a couple storage targets that did
>> > > not go
>> > > according to plan. The cards reported a successful rebuild in each
>> > > case, but
>> > > ldiskfs errors started showing up on the associated OSSs and the
>> > > effected OSTs
>> > > were remounted read-only. We are planning to migrate off the data,
>> > > but we've
>> > > noticed that some clients are getting i/o errors, while others are
>> > > not. As an
>> > > example, a file that has a stripe on at least one affected OST could
>> > > not be
>> > > read on one client, i.e. I received a read-error trying to access it,
>> > > while it
>> > > was perfectly readable and apparently uncorrupted on another (I am
>> > > able to
>> > > migrate the file to healthy OSTs by copying to a new file name). The
>> > > clients
>> > > with the i/o problem see inactive devices corresponding to the
>> > > read-only OSTs
>> > > when I issue a 'lfs df', while the others without the i/o problems
>> > > report the
>> > > targets as normal. Is it just that many clients are not aware of an
>> > > OST problem
>> > > yet? I need clients with minimal I/O disruptions in order to migrate
>> > > as much
>> > > data off as possible.
>> > >
>> > > A client reboot appears to awaken them to the fact that there are
>> > > problems with
>> > > the OSTs. However, I need them to be able to read the data in order to
>> > > migrate
>> > > it off. Is there a way to reconnect the clients to the problematic
>> > > OSTs?
>> > >
>> > > We have dd-ed copies of the OSTs to try e2fsck against them, but the
>> > > results
>> > > were not promising. The check aborted with:
>> > >
>> > > ------
>> > > Resize inode (re)creation failed: A block group is missing an inode
>> > > table.Continue? yes
>> > >
>> > > ext2fs_read_inode: A block group is missing an inode table while
>> > > reading inode
>> > > 7 in recreate inode
>> > > e2fsck: aborted
>> > > ------
>> > >
>> > > Any advice would be greatly appreciated.
>> > > Zach
>> > > _______________________________________________
>> > > Lustre-discuss mailing list
>> > > Lustre-discuss at lists.lustre.org
>> > > http://lists.lustre.org/mailman/listinfo/lustre-discuss
>> >
>
More information about the lustre-discuss
mailing list