[Lustre-discuss] 1.6.6 to 1.8.3 upgrade, OSS with wrong "Target" value

Wojciech Turek wjt27 at cam.ac.uk
Thu Jul 15 15:54:53 PDT 2010


Hi Roger

Where did you find this CONFIG hack?
Did you make a copy of the CONFIG dir before followed this steps?



On 15 July 2010 20:02, Roger Sersted <rs1 at aps.anl.gov> wrote:

>
> I am using the ext4 RPMs.  I ran the following commands on the MDS and OSS
> nodes (lustre was not running at the time):
>
>
>        tune2fs -O extents,uninit_bg,dir_index /dev/XXX
>        fsck -pf /dev/XXX
>
> I then started Lustre "mount -t lustre /dev/XXX /lustre" on the OSSes and
> then the MDS.  The problem still persisted. I then shutdown Lustre by
> unmounting the Lustre filesystems on the MDS/OSS nodes.
>
> My last and most desperate step was to "hack" the CONFIG files.  On puppy7,
> I did the following:
>
>        1. mount -t ldiskfs /dev/sdc /mnt
>        2. cd /mnt/CONFIG
>        3. mv lustre1-OST0000 lustre1-OST0001
>        4. vim -nb lustre1-OST0001 mountdata
>        5. I changed OST0000 to OST0001.
>        6. I verified my changes by comparing an "od -c" of before and
> after.
>        7. umount /mnt
>        8. tunefs.lustre -writeconf /dev/sdc
>
> The output of step 8 is:
>
>  tunefs.lustre -writeconf /dev/sdc
>
> checking for existing Lustre data: found CONFIGS/mountdata
> Reading CONFIGS/mountdata
>
>   Read previous values:
> Target:     lustre1-OST0001
>
> Index:      0
> Lustre FS:  lustre1
> Mount type: ldiskfs
> Flags:      0x102
>              (OST writeconf )
>
> Persistent mount opts: errors=remount-ro,extents,mballoc
> Parameters: mgsnode=172.17.2.5 at o2ib
>
>
>   Permanent disk data:
> Target:     lustre1-OST0000
> Index:      0
> Lustre FS:  lustre1
> Mount type: ldiskfs
> Flags:      0x102
>              (OST writeconf )
>
> Persistent mount opts: errors=remount-ro,extents,mballoc
> Parameters: mgsnode=172.17.2.5 at o2ib
>
> Writing CONFIGS/mountdata
>
> Now part of the system seems to have the correct Target value.
>
> Thanks for your time on this.
>
> Roger S.
>
> Wojciech Turek wrote:
>
>> Hi Roger,
>>
>> the Lustre 1.8.3 for RHEL5 has to set of RPMS one set for old style ext3
>> based ldiskfs and one set for the ext4 based ldiskfs. When upgrading from
>> 1.6.6 to 1.8.3 I think you should not try to use the ext4 based packages,
>> can you let us know which RPMs have you used?
>>
>>
>>
>> On 15 July 2010 16:14, Roger Sersted <rs1 at aps.anl.gov <mailto:
>> rs1 at aps.anl.gov>> wrote:
>>
>>
>>
>>    Wojciech Turek wrote:
>>
>>        can you also please post output of  'rpm -qa | grep lustre' run
>>        on puppy5-7 ?
>>
>>
>>
>>    [root at puppy5 log]# rpm -qa |grep -i lustre
>>    kernel-2.6.18-164.11.1.el5_lustre.1.8.3
>>    lustre-1.8.3-2.6.18_164.11.1.el5_lustre.1.8.3
>>    lustre-ldiskfs-3.0.9-2.6.18_164.11.1.el5_lustre.1.8.3
>>    mft-2.6.0-2.6.18_164.11.1.el5_lustre.1.8.3
>>    lustre-modules-1.8.3-2.6.18_164.11.1.el5_lustre.1.8.3
>>
>>    [root at puppy6 log]# rpm -qa | grep -i lustre
>>    kernel-2.6.18-164.11.1.el5_lustre.1.8.3
>>    lustre-1.8.3-2.6.18_164.11.1.el5_lustre.1.8.3
>>    lustre-ldiskfs-3.0.9-2.6.18_164.11.1.el5_lustre.1.8.3
>>    mft-2.6.0-2.6.18_164.11.1.el5_lustre.1.8.3
>>    lustre-modules-1.8.3-2.6.18_164.11.1.el5_lustre.1.8.3
>>
>>    [root at puppy7 CONFIGS]# rpm -qa | grep -i lustre
>>    kernel-2.6.18-164.11.1.el5_lustre.1.8.3
>>    lustre-1.8.3-2.6.18_164.11.1.el5_lustre.1.8.3
>>    lustre-ldiskfs-3.0.9-2.6.18_164.11.1.el5_lustre.1.8.3
>>    mft-2.6.0-2.6.18_164.11.1.el5_lustre.1.8.3
>>    lustre-modules-1.8.3-2.6.18_164.11.1.el5_lustre.1.8.3
>>
>>    Thanks,
>>
>>    Roger S.
>>
>>
>>        On 15 July 2010 15:55, Roger Sersted <rs1 at aps.anl.gov
>>        <mailto:rs1 at aps.anl.gov> <mailto:rs1 at aps.anl.gov
>>        <mailto:rs1 at aps.anl.gov>>> wrote:
>>
>>
>>           OK.  This looks bad.  It appears that I should have upgraded
>>        ext3 to
>>           ext4, I found instructions for that,
>>
>>                  tune2fs -O extents,uninit_bg,dir_index /dev/XXX
>>                  fsck -pf /dev/XXX
>>                      Is the above correct?  I'd like to move our
>>        systems to ext4. I
>>           didn't know those steps were necessary.
>>
>>           Other answers listed below.
>>
>>
>>           Wojciech Turek wrote:
>>
>>               Hi Roger,
>>
>>               Sorry for the delay. From the ldiskfs messages I seem to
>>        me that
>>               you are using ext4 ldiskfs
>>               (Jun 26 17:54:30 puppy7 kernel: ldiskfs created from
>>               ext4-2.6-rhel5).
>>               If you upgrading from 1.6.6 you ldiskfs is ext3 based so
>>        I think
>>               taht in lustre-1.8.3 you should use ext3 based ldiskfs rpm.
>>
>>               Can you also  tell us a bit more about your setup? From
>>        what you
>>               wrote so far I understand you have 2 OSS servers and each
>>        server
>>               has one OST device. In addition to that you have a third
>>        server
>>               which acts as a MGS/MDS, is that right?
>>
>>               The logs you provided seem to be only from one server called
>>               puppy7 so it does not give a whole picture of the
>>        situation. The
>>               timeout messages may indicate a problem with communication
>>               between the servers but it is really difficult to say
>> without
>>               seeing the whole picture or at least more elements of it.
>>
>>               To check if you have correct rpms installed can you
>>        please run
>>               'rpm -qa | grep lustre' on both OSS servers and the MDS?
>>
>>               Also please provide output from command 'lctl list_nids'
>>         run on
>>               both OSS servers, MDS and a client?
>>
>>
>>           puppy5 (MDS/MGS)
>>
>>           172.17.2.5 at o2ib
>>           172.16.2.5 at tcp
>>
>>           puppy6 (OSS)
>>           172.17.2.6 at o2ib
>>           172.16.2.6 at tcp
>>
>>           puppy7 (OSS)
>>           172.17.2.7 at o2ib
>>           172.16.2.7 at tcp
>>
>>
>>
>>
>>               In addition to above please run following command on all
>>        lustre
>>               targets (OSTs and MDT) to display your current lustre
>>        configuration
>>
>>                tunefs.lustre --dryrun --print /dev/<ost_device>
>>
>>
>>           puppy5 (MDS/MGS)
>>             Read previous values:
>>           Target:     lustre1-MDT0000
>>           Index:      0
>>           Lustre FS:  lustre1
>>           Mount type: ldiskfs
>>           Flags:      0x405
>>                        (MDT MGS )
>>           Persistent mount opts: errors=remount-ro,iopen_nopriv,user_xattr
>>           Parameters: lov.stripesize=125K lov.stripecount=2
>>           mdt.group_upcall=/usr/sbin/l_getgroups mdt.group_upcall=NONE
>>           mdt.group_upcall=NONE
>>
>>
>>             Permanent disk data:
>>           Target:     lustre1-MDT0000
>>           Index:      0
>>           Lustre FS:  lustre1
>>           Mount type: ldiskfs
>>           Flags:      0x405
>>                        (MDT MGS )
>>           Persistent mount opts: errors=remount-ro,iopen_nopriv,user_xattr
>>           Parameters: lov.stripesize=125K lov.stripecount=2
>>           mdt.group_upcall=/usr/sbin/l_getgroups mdt.group_upcall=NONE
>>           mdt.group_upcall=NONE
>>
>>           exiting before disk write.
>>           ----------------------------------------------------
>>           puppy6
>>           checking for existing Lustre data: found CONFIGS/mountdata
>>           Reading CONFIGS/mountdata
>>
>>             Read previous values:
>>           Target:     lustre1-OST0000
>>           Index:      0
>>           Lustre FS:  lustre1
>>           Mount type: ldiskfs
>>           Flags:      0x2
>>                        (OST )
>>           Persistent mount opts: errors=remount-ro,extents,mballoc
>>           Parameters: mgsnode=172.17.2.5 at o2ib
>>
>>
>>             Permanent disk data:
>>           Target:     lustre1-OST0000
>>           Index:      0
>>           Lustre FS:  lustre1
>>           Mount type: ldiskfs
>>           Flags:      0x2
>>                        (OST )
>>           Persistent mount opts: errors=remount-ro,extents,mballoc
>>           Parameters: mgsnode=172.17.2.5 at o2ib
>>           --------------------------------------------------
>>           puppy7 (this is the broken OSS. The "Target" should be
>>           "lustre1-OST0001")
>>           checking for existing Lustre data: found CONFIGS/mountdata
>>           Reading CONFIGS/mountdata
>>
>>             Read previous values:
>>           Target:     lustre1-OST0000
>>           Index:      0
>>           Lustre FS:  lustre1
>>           Mount type: ldiskfs
>>           Flags:      0x2
>>                        (OST )
>>           Persistent mount opts: errors=remount-ro,extents,mballoc
>>           Parameters: mgsnode=172.17.2.5 at o2ib
>>
>>
>>             Permanent disk data:
>>           Target:     lustre1-OST0000
>>           Index:      0
>>           Lustre FS:  lustre1
>>           Mount type: ldiskfs
>>           Flags:      0x2
>>                        (OST )
>>           Persistent mount opts: errors=remount-ro,extents,mballoc
>>           Parameters: mgsnode=172.17.2.5 at o2ib
>>
>>           exiting before disk write.
>>
>>
>>
>>               If possible please attach syslog from each machine from
>>        the time
>>               you mounted lustre targets (OST and MDT).
>>
>>               Best regards,
>>
>>               Wojciech
>>
>>               On 14 July 2010 20:46, Roger Sersted <rs1 at aps.anl.gov
>>        <mailto:rs1 at aps.anl.gov>
>>               <mailto:rs1 at aps.anl.gov <mailto:rs1 at aps.anl.gov>>
>>        <mailto:rs1 at aps.anl.gov <mailto:rs1 at aps.anl.gov>
>>
>>               <mailto:rs1 at aps.anl.gov <mailto:rs1 at aps.anl.gov>>>> wrote:
>>
>>
>>                  Any additional info?
>>
>>                  Thanks,
>>
>>                  Roger S.
>>
>>
>>
>>
>>               --         --
>>               Wojciech Turek
>>
>>
>>
>>
>>
>>        --         --
>>        Wojciech Turek
>>
>>        Assistant System Manager
>>        517
>>
>>


--
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20100715/a28ab0b4/attachment.htm>


More information about the lustre-discuss mailing list