[Lustre-discuss] 1.6.6 to 1.8.3 upgrade, OSS with wrong "Target" value

Thu Jul 15 12:02:09 PDT 2010

I am using the ext4 RPMs.  I ran the following commands on the MDS and OSS nodes 
(lustre was not running at the time):

	tune2fs -O extents,uninit_bg,dir_index /dev/XXX
	fsck -pf /dev/XXX

I then started Lustre "mount -t lustre /dev/XXX /lustre" on the OSSes and then 
the MDS.  The problem still persisted. I then shutdown Lustre by unmounting the 
Lustre filesystems on the MDS/OSS nodes.

My last and most desperate step was to "hack" the CONFIG files.  On puppy7, I 
did the following:

	1. mount -t ldiskfs /dev/sdc /mnt
	2. cd /mnt/CONFIG
	3. mv lustre1-OST0000 lustre1-OST0001
	4. vim -nb lustre1-OST0001 mountdata
	5. I changed OST0000 to OST0001.
	6. I verified my changes by comparing an "od -c" of before and after.
	7. umount /mnt
	8. tunefs.lustre -writeconf /dev/sdc

The output of step 8 is:

   tunefs.lustre -writeconf /dev/sdc
checking for existing Lustre data: found CONFIGS/mountdata
Reading CONFIGS/mountdata

    Read previous values:
Target:     lustre1-OST0001
Index:      0
Lustre FS:  lustre1
Mount type: ldiskfs
Flags:      0x102
               (OST writeconf )
Persistent mount opts: errors=remount-ro,extents,mballoc
Parameters: mgsnode=172.17.2.5 at o2ib

    Permanent disk data:
Target:     lustre1-OST0000
Index:      0
Lustre FS:  lustre1
Mount type: ldiskfs
Flags:      0x102
               (OST writeconf )
Persistent mount opts: errors=remount-ro,extents,mballoc
Parameters: mgsnode=172.17.2.5 at o2ib

Writing CONFIGS/mountdata

Now part of the system seems to have the correct Target value.

Thanks for your time on this.

Roger S.

Wojciech Turek wrote:
> Hi Roger,
> 
> the Lustre 1.8.3 for RHEL5 has to set of RPMS one set for old style ext3 
> based ldiskfs and one set for the ext4 based ldiskfs. When upgrading 
> from 1.6.6 to 1.8.3 I think you should not try to use the ext4 based 
> packages, can you let us know which RPMs have you used?
> 
> 
> 
> On 15 July 2010 16:14, Roger Sersted <rs1 at aps.anl.gov 
> <mailto:rs1 at aps.anl.gov>> wrote:
> 
> 
> 
>     Wojciech Turek wrote:
> 
>         can you also please post output of  'rpm -qa | grep lustre' run
>         on puppy5-7 ?
> 
> 
> 
>     [root at puppy5 log]# rpm -qa |grep -i lustre
>     kernel-2.6.18-164.11.1.el5_lustre.1.8.3
>     lustre-1.8.3-2.6.18_164.11.1.el5_lustre.1.8.3
>     lustre-ldiskfs-3.0.9-2.6.18_164.11.1.el5_lustre.1.8.3
>     mft-2.6.0-2.6.18_164.11.1.el5_lustre.1.8.3
>     lustre-modules-1.8.3-2.6.18_164.11.1.el5_lustre.1.8.3
> 
>     [root at puppy6 log]# rpm -qa | grep -i lustre
>     kernel-2.6.18-164.11.1.el5_lustre.1.8.3
>     lustre-1.8.3-2.6.18_164.11.1.el5_lustre.1.8.3
>     lustre-ldiskfs-3.0.9-2.6.18_164.11.1.el5_lustre.1.8.3
>     mft-2.6.0-2.6.18_164.11.1.el5_lustre.1.8.3
>     lustre-modules-1.8.3-2.6.18_164.11.1.el5_lustre.1.8.3
> 
>     [root at puppy7 CONFIGS]# rpm -qa | grep -i lustre
>     kernel-2.6.18-164.11.1.el5_lustre.1.8.3
>     lustre-1.8.3-2.6.18_164.11.1.el5_lustre.1.8.3
>     lustre-ldiskfs-3.0.9-2.6.18_164.11.1.el5_lustre.1.8.3
>     mft-2.6.0-2.6.18_164.11.1.el5_lustre.1.8.3
>     lustre-modules-1.8.3-2.6.18_164.11.1.el5_lustre.1.8.3
> 
>     Thanks,
> 
>     Roger S.
> 
> 
>         On 15 July 2010 15:55, Roger Sersted <rs1 at aps.anl.gov
>         <mailto:rs1 at aps.anl.gov> <mailto:rs1 at aps.anl.gov
>         <mailto:rs1 at aps.anl.gov>>> wrote:
> 
> 
>            OK.  This looks bad.  It appears that I should have upgraded
>         ext3 to
>            ext4, I found instructions for that,
> 
>                   tune2fs -O extents,uninit_bg,dir_index /dev/XXX
>                   fsck -pf /dev/XXX
>                       Is the above correct?  I'd like to move our
>         systems to ext4. I
>            didn't know those steps were necessary.
> 
>            Other answers listed below.
> 
> 
>            Wojciech Turek wrote:
> 
>                Hi Roger,
> 
>                Sorry for the delay. From the ldiskfs messages I seem to
>         me that
>                you are using ext4 ldiskfs
>                (Jun 26 17:54:30 puppy7 kernel: ldiskfs created from
>                ext4-2.6-rhel5).
>                If you upgrading from 1.6.6 you ldiskfs is ext3 based so
>         I think
>                taht in lustre-1.8.3 you should use ext3 based ldiskfs rpm.
> 
>                Can you also  tell us a bit more about your setup? From
>         what you
>                wrote so far I understand you have 2 OSS servers and each
>         server
>                has one OST device. In addition to that you have a third
>         server
>                which acts as a MGS/MDS, is that right?
> 
>                The logs you provided seem to be only from one server called
>                puppy7 so it does not give a whole picture of the
>         situation. The
>                timeout messages may indicate a problem with communication
>                between the servers but it is really difficult to say without
>                seeing the whole picture or at least more elements of it.
> 
>                To check if you have correct rpms installed can you
>         please run
>                'rpm -qa | grep lustre' on both OSS servers and the MDS?
> 
>                Also please provide output from command 'lctl list_nids'
>          run on
>                both OSS servers, MDS and a client?
> 
> 
>            puppy5 (MDS/MGS)
> 
>            172.17.2.5 at o2ib
>            172.16.2.5 at tcp
> 
>            puppy6 (OSS)
>            172.17.2.6 at o2ib
>            172.16.2.6 at tcp
> 
>            puppy7 (OSS)
>            172.17.2.7 at o2ib
>            172.16.2.7 at tcp
> 
> 
> 
> 
>                In addition to above please run following command on all
>         lustre
>                targets (OSTs and MDT) to display your current lustre
>         configuration
> 
>                 tunefs.lustre --dryrun --print /dev/<ost_device>
> 
> 
>            puppy5 (MDS/MGS)
>              Read previous values:
>            Target:     lustre1-MDT0000
>            Index:      0
>            Lustre FS:  lustre1
>            Mount type: ldiskfs
>            Flags:      0x405
>                         (MDT MGS )
>            Persistent mount opts: errors=remount-ro,iopen_nopriv,user_xattr
>            Parameters: lov.stripesize=125K lov.stripecount=2
>            mdt.group_upcall=/usr/sbin/l_getgroups mdt.group_upcall=NONE
>            mdt.group_upcall=NONE
> 
> 
>              Permanent disk data:
>            Target:     lustre1-MDT0000
>            Index:      0
>            Lustre FS:  lustre1
>            Mount type: ldiskfs
>            Flags:      0x405
>                         (MDT MGS )
>            Persistent mount opts: errors=remount-ro,iopen_nopriv,user_xattr
>            Parameters: lov.stripesize=125K lov.stripecount=2
>            mdt.group_upcall=/usr/sbin/l_getgroups mdt.group_upcall=NONE
>            mdt.group_upcall=NONE
> 
>            exiting before disk write.
>            ----------------------------------------------------
>            puppy6
>            checking for existing Lustre data: found CONFIGS/mountdata
>            Reading CONFIGS/mountdata
> 
>              Read previous values:
>            Target:     lustre1-OST0000
>            Index:      0
>            Lustre FS:  lustre1
>            Mount type: ldiskfs
>            Flags:      0x2
>                         (OST )
>            Persistent mount opts: errors=remount-ro,extents,mballoc
>            Parameters: mgsnode=172.17.2.5 at o2ib
> 
> 
>              Permanent disk data:
>            Target:     lustre1-OST0000
>            Index:      0
>            Lustre FS:  lustre1
>            Mount type: ldiskfs
>            Flags:      0x2
>                         (OST )
>            Persistent mount opts: errors=remount-ro,extents,mballoc
>            Parameters: mgsnode=172.17.2.5 at o2ib
>            --------------------------------------------------
>            puppy7 (this is the broken OSS. The "Target" should be
>            "lustre1-OST0001")
>            checking for existing Lustre data: found CONFIGS/mountdata
>            Reading CONFIGS/mountdata
> 
>              Read previous values:
>            Target:     lustre1-OST0000
>            Index:      0
>            Lustre FS:  lustre1
>            Mount type: ldiskfs
>            Flags:      0x2
>                         (OST )
>            Persistent mount opts: errors=remount-ro,extents,mballoc
>            Parameters: mgsnode=172.17.2.5 at o2ib
> 
> 
>              Permanent disk data:
>            Target:     lustre1-OST0000
>            Index:      0
>            Lustre FS:  lustre1
>            Mount type: ldiskfs
>            Flags:      0x2
>                         (OST )
>            Persistent mount opts: errors=remount-ro,extents,mballoc
>            Parameters: mgsnode=172.17.2.5 at o2ib
> 
>            exiting before disk write.
> 
> 
> 
>                If possible please attach syslog from each machine from
>         the time
>                you mounted lustre targets (OST and MDT).
> 
>                Best regards,
> 
>                Wojciech
> 
>                On 14 July 2010 20:46, Roger Sersted <rs1 at aps.anl.gov
>         <mailto:rs1 at aps.anl.gov>
>                <mailto:rs1 at aps.anl.gov <mailto:rs1 at aps.anl.gov>>
>         <mailto:rs1 at aps.anl.gov <mailto:rs1 at aps.anl.gov>
> 
>                <mailto:rs1 at aps.anl.gov <mailto:rs1 at aps.anl.gov>>>> wrote:
> 
> 
>                   Any additional info?
> 
>                   Thanks,
> 
>                   Roger S.
> 
> 
> 
> 
>                --         --
>                Wojciech Turek
> 
> 
> 
> 
> 
>         -- 
>         --
>         Wojciech Turek
> 
>         Assistant System Manager
>         517
>