[Lustre-discuss] 1.6.6 to 1.8.3 upgrade, OSS with wrong "Target" value

Roger Sersted rs1 at aps.anl.gov
Fri Jul 16 06:43:27 PDT 2010


I didn't find the hack anywhere.  I looked at what those files contained and 
decided to "hack and slash".  Apparently, those files are generated from data 
within the filesystem system itself.  A second running of writeconf displayed 
the target value to be "lustre1-OST0000", which is what I didn't want. :-(


Roger S.

Wojciech Turek wrote:
> Hi Roger
> 
> Where did you find this CONFIG hack?
> Did you make a copy of the CONFIG dir before followed this steps?
> 
> 
> 
> On 15 July 2010 20:02, Roger Sersted <rs1 at aps.anl.gov 
> <mailto:rs1 at aps.anl.gov>> wrote:
> 
> 
>     I am using the ext4 RPMs.  I ran the following commands on the MDS
>     and OSS nodes (lustre was not running at the time):
> 
> 
>            tune2fs -O extents,uninit_bg,dir_index /dev/XXX
>            fsck -pf /dev/XXX
> 
>     I then started Lustre "mount -t lustre /dev/XXX /lustre" on the
>     OSSes and then the MDS.  The problem still persisted. I then
>     shutdown Lustre by unmounting the Lustre filesystems on the MDS/OSS
>     nodes.
> 
>     My last and most desperate step was to "hack" the CONFIG files.  On
>     puppy7, I did the following:
> 
>            1. mount -t ldiskfs /dev/sdc /mnt
>            2. cd /mnt/CONFIG
>            3. mv lustre1-OST0000 lustre1-OST0001
>            4. vim -nb lustre1-OST0001 mountdata
>            5. I changed OST0000 to OST0001.
>            6. I verified my changes by comparing an "od -c" of before
>     and after.
>            7. umount /mnt
>            8. tunefs.lustre -writeconf /dev/sdc
> 
>     The output of step 8 is:
> 
>      tunefs.lustre -writeconf /dev/sdc
> 
>     checking for existing Lustre data: found CONFIGS/mountdata
>     Reading CONFIGS/mountdata
> 
>       Read previous values:
>     Target:     lustre1-OST0001
> 
>     Index:      0
>     Lustre FS:  lustre1
>     Mount type: ldiskfs
>     Flags:      0x102
>                  (OST writeconf )
> 
>     Persistent mount opts: errors=remount-ro,extents,mballoc
>     Parameters: mgsnode=172.17.2.5 at o2ib
> 
> 
>       Permanent disk data:
>     Target:     lustre1-OST0000
>     Index:      0
>     Lustre FS:  lustre1
>     Mount type: ldiskfs
>     Flags:      0x102
>                  (OST writeconf )
> 
>     Persistent mount opts: errors=remount-ro,extents,mballoc
>     Parameters: mgsnode=172.17.2.5 at o2ib
> 
>     Writing CONFIGS/mountdata
> 
>     Now part of the system seems to have the correct Target value.
> 
>     Thanks for your time on this.
> 
>     Roger S.
> 
>     Wojciech Turek wrote:
> 
>         Hi Roger,
> 
>         the Lustre 1.8.3 for RHEL5 has to set of RPMS one set for old
>         style ext3 based ldiskfs and one set for the ext4 based ldiskfs.
>         When upgrading from 1.6.6 to 1.8.3 I think you should not try to
>         use the ext4 based packages, can you let us know which RPMs have
>         you used?
> 
> 
> 
>         On 15 July 2010 16:14, Roger Sersted <rs1 at aps.anl.gov
>         <mailto:rs1 at aps.anl.gov> <mailto:rs1 at aps.anl.gov
>         <mailto:rs1 at aps.anl.gov>>> wrote:
> 
> 
> 
>            Wojciech Turek wrote:
> 
>                can you also please post output of  'rpm -qa | grep
>         lustre' run
>                on puppy5-7 ?
> 
> 
> 
>            [root at puppy5 log]# rpm -qa |grep -i lustre
>            kernel-2.6.18-164.11.1.el5_lustre.1.8.3
>            lustre-1.8.3-2.6.18_164.11.1.el5_lustre.1.8.3
>            lustre-ldiskfs-3.0.9-2.6.18_164.11.1.el5_lustre.1.8.3
>            mft-2.6.0-2.6.18_164.11.1.el5_lustre.1.8.3
>            lustre-modules-1.8.3-2.6.18_164.11.1.el5_lustre.1.8.3
> 
>            [root at puppy6 log]# rpm -qa | grep -i lustre
>            kernel-2.6.18-164.11.1.el5_lustre.1.8.3
>            lustre-1.8.3-2.6.18_164.11.1.el5_lustre.1.8.3
>            lustre-ldiskfs-3.0.9-2.6.18_164.11.1.el5_lustre.1.8.3
>            mft-2.6.0-2.6.18_164.11.1.el5_lustre.1.8.3
>            lustre-modules-1.8.3-2.6.18_164.11.1.el5_lustre.1.8.3
> 
>            [root at puppy7 CONFIGS]# rpm -qa | grep -i lustre
>            kernel-2.6.18-164.11.1.el5_lustre.1.8.3
>            lustre-1.8.3-2.6.18_164.11.1.el5_lustre.1.8.3
>            lustre-ldiskfs-3.0.9-2.6.18_164.11.1.el5_lustre.1.8.3
>            mft-2.6.0-2.6.18_164.11.1.el5_lustre.1.8.3
>            lustre-modules-1.8.3-2.6.18_164.11.1.el5_lustre.1.8.3
> 
>            Thanks,
> 
>            Roger S.
> 
> 
>                On 15 July 2010 15:55, Roger Sersted <rs1 at aps.anl.gov
>         <mailto:rs1 at aps.anl.gov>
>                <mailto:rs1 at aps.anl.gov <mailto:rs1 at aps.anl.gov>>
>         <mailto:rs1 at aps.anl.gov <mailto:rs1 at aps.anl.gov>
>                <mailto:rs1 at aps.anl.gov <mailto:rs1 at aps.anl.gov>>>> wrote:
> 
> 
>                   OK.  This looks bad.  It appears that I should have
>         upgraded
>                ext3 to
>                   ext4, I found instructions for that,
> 
>                          tune2fs -O extents,uninit_bg,dir_index /dev/XXX
>                          fsck -pf /dev/XXX
>                              Is the above correct?  I'd like to move our
>                systems to ext4. I
>                   didn't know those steps were necessary.
> 
>                   Other answers listed below.
> 
> 
>                   Wojciech Turek wrote:
> 
>                       Hi Roger,
> 
>                       Sorry for the delay. From the ldiskfs messages I
>         seem to
>                me that
>                       you are using ext4 ldiskfs
>                       (Jun 26 17:54:30 puppy7 kernel: ldiskfs created from
>                       ext4-2.6-rhel5).
>                       If you upgrading from 1.6.6 you ldiskfs is ext3
>         based so
>                I think
>                       taht in lustre-1.8.3 you should use ext3 based
>         ldiskfs rpm.
> 
>                       Can you also  tell us a bit more about your setup?
>         From
>                what you
>                       wrote so far I understand you have 2 OSS servers
>         and each
>                server
>                       has one OST device. In addition to that you have a
>         third
>                server
>                       which acts as a MGS/MDS, is that right?
> 
>                       The logs you provided seem to be only from one
>         server called
>                       puppy7 so it does not give a whole picture of the
>                situation. The
>                       timeout messages may indicate a problem with
>         communication
>                       between the servers but it is really difficult to
>         say without
>                       seeing the whole picture or at least more elements
>         of it.
> 
>                       To check if you have correct rpms installed can you
>                please run
>                       'rpm -qa | grep lustre' on both OSS servers and
>         the MDS?
> 
>                       Also please provide output from command 'lctl
>         list_nids'
>                 run on
>                       both OSS servers, MDS and a client?
> 
> 
>                   puppy5 (MDS/MGS)
> 
>                   172.17.2.5 at o2ib
>                   172.16.2.5 at tcp
> 
>                   puppy6 (OSS)
>                   172.17.2.6 at o2ib
>                   172.16.2.6 at tcp
> 
>                   puppy7 (OSS)
>                   172.17.2.7 at o2ib
>                   172.16.2.7 at tcp
> 
> 
> 
> 
>                       In addition to above please run following command
>         on all
>                lustre
>                       targets (OSTs and MDT) to display your current lustre
>                configuration
> 
>                        tunefs.lustre --dryrun --print /dev/<ost_device>
> 
> 
>                   puppy5 (MDS/MGS)
>                     Read previous values:
>                   Target:     lustre1-MDT0000
>                   Index:      0
>                   Lustre FS:  lustre1
>                   Mount type: ldiskfs
>                   Flags:      0x405
>                                (MDT MGS )
>                   Persistent mount opts:
>         errors=remount-ro,iopen_nopriv,user_xattr
>                   Parameters: lov.stripesize=125K lov.stripecount=2
>                   mdt.group_upcall=/usr/sbin/l_getgroups
>         mdt.group_upcall=NONE
>                   mdt.group_upcall=NONE
> 
> 
>                     Permanent disk data:
>                   Target:     lustre1-MDT0000
>                   Index:      0
>                   Lustre FS:  lustre1
>                   Mount type: ldiskfs
>                   Flags:      0x405
>                                (MDT MGS )
>                   Persistent mount opts:
>         errors=remount-ro,iopen_nopriv,user_xattr
>                   Parameters: lov.stripesize=125K lov.stripecount=2
>                   mdt.group_upcall=/usr/sbin/l_getgroups
>         mdt.group_upcall=NONE
>                   mdt.group_upcall=NONE
> 
>                   exiting before disk write.
>                   ----------------------------------------------------
>                   puppy6
>                   checking for existing Lustre data: found CONFIGS/mountdata
>                   Reading CONFIGS/mountdata
> 
>                     Read previous values:
>                   Target:     lustre1-OST0000
>                   Index:      0
>                   Lustre FS:  lustre1
>                   Mount type: ldiskfs
>                   Flags:      0x2
>                                (OST )
>                   Persistent mount opts: errors=remount-ro,extents,mballoc
>                   Parameters: mgsnode=172.17.2.5 at o2ib
> 
> 
>                     Permanent disk data:
>                   Target:     lustre1-OST0000
>                   Index:      0
>                   Lustre FS:  lustre1
>                   Mount type: ldiskfs
>                   Flags:      0x2
>                                (OST )
>                   Persistent mount opts: errors=remount-ro,extents,mballoc
>                   Parameters: mgsnode=172.17.2.5 at o2ib
>                   --------------------------------------------------
>                   puppy7 (this is the broken OSS. The "Target" should be
>                   "lustre1-OST0001")
>                   checking for existing Lustre data: found CONFIGS/mountdata
>                   Reading CONFIGS/mountdata
> 
>                     Read previous values:
>                   Target:     lustre1-OST0000
>                   Index:      0
>                   Lustre FS:  lustre1
>                   Mount type: ldiskfs
>                   Flags:      0x2
>                                (OST )
>                   Persistent mount opts: errors=remount-ro,extents,mballoc
>                   Parameters: mgsnode=172.17.2.5 at o2ib
> 
> 
>                     Permanent disk data:
>                   Target:     lustre1-OST0000
>                   Index:      0
>                   Lustre FS:  lustre1
>                   Mount type: ldiskfs
>                   Flags:      0x2
>                                (OST )
>                   Persistent mount opts: errors=remount-ro,extents,mballoc
>                   Parameters: mgsnode=172.17.2.5 at o2ib
> 
>                   exiting before disk write.
> 
> 
> 
>                       If possible please attach syslog from each machine
>         from
>                the time
>                       you mounted lustre targets (OST and MDT).
> 
>                       Best regards,
> 
>                       Wojciech
> 
>                       On 14 July 2010 20:46, Roger Sersted
>         <rs1 at aps.anl.gov <mailto:rs1 at aps.anl.gov>
>                <mailto:rs1 at aps.anl.gov <mailto:rs1 at aps.anl.gov>>
>                       <mailto:rs1 at aps.anl.gov <mailto:rs1 at aps.anl.gov>
>         <mailto:rs1 at aps.anl.gov <mailto:rs1 at aps.anl.gov>>>
>                <mailto:rs1 at aps.anl.gov <mailto:rs1 at aps.anl.gov>
>         <mailto:rs1 at aps.anl.gov <mailto:rs1 at aps.anl.gov>>
> 
>                       <mailto:rs1 at aps.anl.gov <mailto:rs1 at aps.anl.gov>
>         <mailto:rs1 at aps.anl.gov <mailto:rs1 at aps.anl.gov>>>>> wrote:
> 
> 
>                          Any additional info?
> 
>                          Thanks,
> 
>                          Roger S.
> 
> 
> 
> 
>                       --         --
>                       Wojciech Turek
> 
> 
> 
> 
> 
>                --         --
>                Wojciech Turek
> 
>                Assistant System Manager
>                517
> 
> 
> 
> 
> -- 



More information about the lustre-discuss mailing list