[Lustre-discuss] IO-Node issue

DaMiri Young damiri at unt.edu
Wed Jul 20 08:37:46 PDT 2011


Hi Wojciech,
Stopping heartbeat sounds like a logical next step. Before I do that 
though I tried a fsck dry run using e2fsprogs v1.14.10 and got:
---------------------------------------------------------------
# e2fsck -n -v /dev/dm-11
e2fsck 1.41.10.sun2 (24-Feb-2010)
device /dev/dm-11 mounted by lustre per 
/proc/fs/lustre/obdfilter/es1-OST000a/mntdev
Warning!  /dev/dm-11 is mounted.
e2fsck: MMP: device currently active while trying to open /dev/dm-11

The superblock could not be read or does not describe a correct ext2
filesystem.  If the device is valid and it really contains an ext2
filesystem (and not swap or ufs or something else), then the superblock
is corrupt, and you might try running e2fsck with an alternate superblock:
     e2fsck -b 32768 <device>
------------------------------------------------------------------

Do you suppose stopping heartbeat will allow the OST to be unmounted all 
the way be lustre? I tried unmounting manually and got:
------------------------------------------------------------------
# umount /dev/dm-11
umount: /dev/dm-11: not mounted
------------------------------------------------------------------


Wojciech Turek wrote:
> Hi Damiri,
> 
> If heartbeat is not able to start(mount) one of the OSTs I would 
> recommend to stop heartbeat on both servers and then mount troubled OST 
> manually. Then you should see why OST is not mounted. In order to check 
> the consistency of the filesystem, in your case I would first run fsck 
> with -n switch to see extent of the damage, this also prevents from 
> damaging your filesystem even more if you have a faulty controller or 
> links corrupting data. In normal situation I use following command: fsck 
> -f -v /dev/<ost_dev> -C0
> Make sure that you log output from the fsck which will be essential for 
> the further troubleshooting.
> 
> Best regards,
> 
> Wojciech
> 
> On 19 July 2011 16:58, Young, Damiri <Damiri.Young at unt.edu 
> <mailto:Damiri.Young at unt.edu>> wrote:
> 
>     Many thanks for the useful info Turek. I mentioned HA (heartbeat v2)
>     issues because after the troubled I/O got it's paths back to the
>     OST's it failed all 4 of the 5 OSTs over to it's sibling server
>     where they're now mounted. To me it seems the OSTs (we're using
>     lustre v1.6 btw) won't be released until the failed over node is
>     reset by it's sibling.
> 
>     The OSSs seem to have trouble connecting to the 1 OST I mentioned:
>     -------------------------------- messages
>     -------------------------------------
>     Jul 19 10:29:02 IO-10 kernel: LustreError: 29429: 


-- 
DaMiri Young
HPC System Engineer
High Performance Computing Team | ACUS/CITC | UNT



More information about the lustre-discuss mailing list