[Lustre-discuss] Lustre client question

Fri May 13 13:29:06 PDT 2011

It sounds like it is working better.  Did the clients recover?  I would 
have re-run fsck before mounting it again, and moving the data off may 
still be the best plan.  Since dropping the rebuilt drive reduced the 
corruption, certainly contact your raid vendor over this issue.

Kevin

Zachary Beebleson wrote:
> Kevin,
>
> I just failed the drive and remounted. A basic 'df' hangs when it gets to
> the mount point, but /proc/fs/lustre/health_check reports everything is
> healthy. 'lfs df' on a client reports the OST is active, where it was
> inactive before. However, now I'm working with a degraded volume, but it
> is raid 6. Should I try another rebuild or just proceed with the
> mirgration off of this OST asap?
>
> Thanks,
> Zach
>
> PS. Sorry for the repeat message
> On Fri, 13 May 2011, Kevin Van Maren wrote:
>
>> See bug 24264 -- certainly possible that the raid controller 
>> corrupted your filesystem.
>>
>> If you remove the new drive and reboot, does the file system look 
>> cleaner?
>>
>> Kevin
>>
>>
>> On May 13, 2011, at 11:39 AM, Zachary Beebleson 
>> <zbeeble at math.uchicago.edu> wrote:
>>
>>>
>>> We recently had two raid rebuilds on a couple storage targets that 
>>> did not go
>>> according to plan. The cards reported a successful rebuild in each 
>>> case, but
>>> ldiskfs errors started showing up on the associated OSSs and the 
>>> effected OSTs
>>> were  remounted read-only. We are planning to migrate off the data, 
>>> but we've
>>> noticed that some clients are getting i/o errors, while others are 
>>> not. As an
>>> example, a file that has a stripe on at least one affected OST could 
>>> not be
>>> read on one client, i.e. I received a read-error trying to access 
>>> it, while it
>>> was perfectly readable and apparently uncorrupted on another (I am 
>>> able to
>>> migrate the file to healthy OSTs by copying to a new file name). The 
>>> clients
>>> with the i/o problem see inactive devices corresponding to the 
>>> read-only OSTs
>>> when I issue a 'lfs df', while the others without the i/o problems 
>>> report the
>>> targets as normal. Is it just that many clients are not aware of an 
>>> OST problem
>>> yet? I need clients with minimal I/O disruptions in order to migrate 
>>> as much
>>> data off as possible.
>>>
>>> A client reboot appears to awaken them to the fact that there are 
>>> problems with
>>> the OSTs. However, I need them to be able to read the data in order 
>>> to migrate
>>> it off. Is there a way to reconnect the clients to the problematic 
>>> OSTs?
>>>
>>> We have dd-ed copies of the OSTs to try e2fsck against them, but the 
>>> results
>>> were not promising. The check aborted with:
>>>
>>> ------
>>> Resize inode (re)creation failed: A block group is missing an inode
>>> table.Continue? yes
>>>
>>> ext2fs_read_inode: A block group is missing an inode table while 
>>> reading inode
>>> 7 in recreate inode
>>> e2fsck: aborted
>>> ------
>>>
>>> Any advice would be greatly appreciated.
>>> Zach
>>> _______________________________________________
>>> Lustre-discuss mailing list
>>> Lustre-discuss at lists.lustre.org
>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>