[Lustre-discuss] IO-Node issue

Wojciech Turek wjt27 at cam.ac.uk
Wed Jul 20 13:44:41 PDT 2011


Hi Damiri

I use more recent e2fsprogs:
e2fsprogs-1.41.12.2.ora1-0redhat.x86_64

I think you can get even more recent version from Whamcloude.

I remember that your version of e2fsck does not allow access to mounted FS
in read only mode which is annoying and not necessary. With more recent
version fsck -n should run even on mounted FS.

Anyway for proper fsck'ing you need to umount device that you want to check
and make sure that it is not mounted on the other node. Stopping heartbeat
should automatically umount your OSTs if your FileSystem resources are
properly configured.

Best regards,

Wojciech

On 20 July 2011 16:37, DaMiri Young <damiri at unt.edu> wrote:

> Hi Wojciech,
> Stopping heartbeat sounds like a logical next step. Before I do that though
> I tried a fsck dry run using e2fsprogs v1.14.10 and got:
> ------------------------------**------------------------------**---
> # e2fsck -n -v /dev/dm-11
> e2fsck 1.41.10.sun2 (24-Feb-2010)
> device /dev/dm-11 mounted by lustre per /proc/fs/lustre/obdfilter/es1-**
> OST000a/mntdev
> Warning!  /dev/dm-11 is mounted.
> e2fsck: MMP: device currently active while trying to open /dev/dm-11
>
> The superblock could not be read or does not describe a correct ext2
> filesystem.  If the device is valid and it really contains an ext2
> filesystem (and not swap or ufs or something else), then the superblock
> is corrupt, and you might try running e2fsck with an alternate superblock:
>    e2fsck -b 32768 <device>
> ------------------------------**------------------------------**------
>
> Do you suppose stopping heartbeat will allow the OST to be unmounted all
> the way be lustre? I tried unmounting manually and got:
> ------------------------------**------------------------------**------
> # umount /dev/dm-11
> umount: /dev/dm-11: not mounted
> ------------------------------**------------------------------**------
>
>
> Wojciech Turek wrote:
>
>> Hi Damiri,
>>
>> If heartbeat is not able to start(mount) one of the OSTs I would recommend
>> to stop heartbeat on both servers and then mount troubled OST manually. Then
>> you should see why OST is not mounted. In order to check the consistency of
>> the filesystem, in your case I would first run fsck with -n switch to see
>> extent of the damage, this also prevents from damaging your filesystem even
>> more if you have a faulty controller or links corrupting data. In normal
>> situation I use following command: fsck -f -v /dev/<ost_dev> -C0
>> Make sure that you log output from the fsck which will be essential for
>> the further troubleshooting.
>>
>> Best regards,
>>
>> Wojciech
>>
>> On 19 July 2011 16:58, Young, Damiri <Damiri.Young at unt.edu <mailto:
>> Damiri.Young at unt.edu>> wrote:
>>
>>    Many thanks for the useful info Turek. I mentioned HA (heartbeat v2)
>>    issues because after the troubled I/O got it's paths back to the
>>    OST's it failed all 4 of the 5 OSTs over to it's sibling server
>>    where they're now mounted. To me it seems the OSTs (we're using
>>    lustre v1.6 btw) won't be released until the failed over node is
>>    reset by it's sibling.
>>
>>    The OSSs seem to have trouble connecting to the 1 OST I mentioned:
>>    ------------------------------**-- messages
>>    ------------------------------**-------
>>    Jul 19 10:29:02 IO-10 kernel: LustreError: 29429:
>>
>
>
> --
> DaMiri Young
> HPC System Engineer
> High Performance Computing Team | ACUS/CITC | UNT
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20110720/b1f85ca5/attachment.htm>


More information about the lustre-discuss mailing list