[Lustre-discuss] 1.6.6 to 1.8.3 upgrade, OSS with wrong "Target" value

Thu Jul 15 08:14:07 PDT 2010


Wojciech Turek wrote:
> can you also please post output of  'rpm -qa | grep lustre' run on 
> puppy5-7 ?


[root at puppy5 log]# rpm -qa |grep -i lustre
kernel-2.6.18-164.11.1.el5_lustre.1.8.3
lustre-1.8.3-2.6.18_164.11.1.el5_lustre.1.8.3
lustre-ldiskfs-3.0.9-2.6.18_164.11.1.el5_lustre.1.8.3
mft-2.6.0-2.6.18_164.11.1.el5_lustre.1.8.3
lustre-modules-1.8.3-2.6.18_164.11.1.el5_lustre.1.8.3

[root at puppy6 log]# rpm -qa | grep -i lustre
kernel-2.6.18-164.11.1.el5_lustre.1.8.3
lustre-1.8.3-2.6.18_164.11.1.el5_lustre.1.8.3
lustre-ldiskfs-3.0.9-2.6.18_164.11.1.el5_lustre.1.8.3
mft-2.6.0-2.6.18_164.11.1.el5_lustre.1.8.3
lustre-modules-1.8.3-2.6.18_164.11.1.el5_lustre.1.8.3

[root at puppy7 CONFIGS]# rpm -qa | grep -i lustre
kernel-2.6.18-164.11.1.el5_lustre.1.8.3
lustre-1.8.3-2.6.18_164.11.1.el5_lustre.1.8.3
lustre-ldiskfs-3.0.9-2.6.18_164.11.1.el5_lustre.1.8.3
mft-2.6.0-2.6.18_164.11.1.el5_lustre.1.8.3
lustre-modules-1.8.3-2.6.18_164.11.1.el5_lustre.1.8.3

Thanks,

Roger S.

> 
> On 15 July 2010 15:55, Roger Sersted <rs1 at aps.anl.gov 
> <mailto:rs1 at aps.anl.gov>> wrote:
> 
> 
>     OK.  This looks bad.  It appears that I should have upgraded ext3 to
>     ext4, I found instructions for that,
> 
>            tune2fs -O extents,uninit_bg,dir_index /dev/XXX
>            fsck -pf /dev/XXX
>            
>     Is the above correct?  I'd like to move our systems to ext4. I
>     didn't know those steps were necessary.
> 
>     Other answers listed below.
> 
> 
>     Wojciech Turek wrote:
> 
>         Hi Roger,
> 
>         Sorry for the delay. From the ldiskfs messages I seem to me that
>         you are using ext4 ldiskfs
>         (Jun 26 17:54:30 puppy7 kernel: ldiskfs created from
>         ext4-2.6-rhel5).
>         If you upgrading from 1.6.6 you ldiskfs is ext3 based so I think
>         taht in lustre-1.8.3 you should use ext3 based ldiskfs rpm.
> 
>         Can you also  tell us a bit more about your setup? From what you
>         wrote so far I understand you have 2 OSS servers and each server
>         has one OST device. In addition to that you have a third server
>         which acts as a MGS/MDS, is that right?
> 
>         The logs you provided seem to be only from one server called
>         puppy7 so it does not give a whole picture of the situation. The
>         timeout messages may indicate a problem with communication
>         between the servers but it is really difficult to say without
>         seeing the whole picture or at least more elements of it.
> 
>         To check if you have correct rpms installed can you please run
>         'rpm -qa | grep lustre' on both OSS servers and the MDS?
> 
>         Also please provide output from command 'lctl list_nids'  run on
>         both OSS servers, MDS and a client?
> 
> 
>     puppy5 (MDS/MGS)
> 
>     172.17.2.5 at o2ib
>     172.16.2.5 at tcp
> 
>     puppy6 (OSS)
>     172.17.2.6 at o2ib
>     172.16.2.6 at tcp
> 
>     puppy7 (OSS)
>     172.17.2.7 at o2ib
>     172.16.2.7 at tcp
> 
> 
> 
> 
>         In addition to above please run following command on all lustre
>         targets (OSTs and MDT) to display your current lustre configuration
> 
>          tunefs.lustre --dryrun --print /dev/<ost_device>
> 
> 
>     puppy5 (MDS/MGS)
>       Read previous values:
>     Target:     lustre1-MDT0000
>     Index:      0
>     Lustre FS:  lustre1
>     Mount type: ldiskfs
>     Flags:      0x405
>                  (MDT MGS )
>     Persistent mount opts: errors=remount-ro,iopen_nopriv,user_xattr
>     Parameters: lov.stripesize=125K lov.stripecount=2
>     mdt.group_upcall=/usr/sbin/l_getgroups mdt.group_upcall=NONE
>     mdt.group_upcall=NONE
> 
> 
>       Permanent disk data:
>     Target:     lustre1-MDT0000
>     Index:      0
>     Lustre FS:  lustre1
>     Mount type: ldiskfs
>     Flags:      0x405
>                  (MDT MGS )
>     Persistent mount opts: errors=remount-ro,iopen_nopriv,user_xattr
>     Parameters: lov.stripesize=125K lov.stripecount=2
>     mdt.group_upcall=/usr/sbin/l_getgroups mdt.group_upcall=NONE
>     mdt.group_upcall=NONE
> 
>     exiting before disk write.
>     ----------------------------------------------------
>     puppy6
>     checking for existing Lustre data: found CONFIGS/mountdata
>     Reading CONFIGS/mountdata
> 
>       Read previous values:
>     Target:     lustre1-OST0000
>     Index:      0
>     Lustre FS:  lustre1
>     Mount type: ldiskfs
>     Flags:      0x2
>                  (OST )
>     Persistent mount opts: errors=remount-ro,extents,mballoc
>     Parameters: mgsnode=172.17.2.5 at o2ib
> 
> 
>       Permanent disk data:
>     Target:     lustre1-OST0000
>     Index:      0
>     Lustre FS:  lustre1
>     Mount type: ldiskfs
>     Flags:      0x2
>                  (OST )
>     Persistent mount opts: errors=remount-ro,extents,mballoc
>     Parameters: mgsnode=172.17.2.5 at o2ib
>     --------------------------------------------------
>     puppy7 (this is the broken OSS. The "Target" should be
>     "lustre1-OST0001")
>     checking for existing Lustre data: found CONFIGS/mountdata
>     Reading CONFIGS/mountdata
> 
>       Read previous values:
>     Target:     lustre1-OST0000
>     Index:      0
>     Lustre FS:  lustre1
>     Mount type: ldiskfs
>     Flags:      0x2
>                  (OST )
>     Persistent mount opts: errors=remount-ro,extents,mballoc
>     Parameters: mgsnode=172.17.2.5 at o2ib
> 
> 
>       Permanent disk data:
>     Target:     lustre1-OST0000
>     Index:      0
>     Lustre FS:  lustre1
>     Mount type: ldiskfs
>     Flags:      0x2
>                  (OST )
>     Persistent mount opts: errors=remount-ro,extents,mballoc
>     Parameters: mgsnode=172.17.2.5 at o2ib
> 
>     exiting before disk write.
> 
> 
> 
>         If possible please attach syslog from each machine from the time
>         you mounted lustre targets (OST and MDT).
> 
>         Best regards,
> 
>         Wojciech
> 
>         On 14 July 2010 20:46, Roger Sersted <rs1 at aps.anl.gov
>         <mailto:rs1 at aps.anl.gov> <mailto:rs1 at aps.anl.gov
>         <mailto:rs1 at aps.anl.gov>>> wrote:
> 
> 
>            Any additional info?
> 
>            Thanks,
> 
>            Roger S.
> 
> 
> 
> 
>         -- 
>         --
>         Wojciech Turek
> 
> 
> 
> 
> 
> -- 
> --
> Wojciech Turek
> 
> Assistant System Manager
> 
> High Performance Computing Service
> University of Cambridge
> Email: wjt27 at cam.ac.uk <mailto:wjt27 at cam.ac.uk>
> Tel: (+)44 1223 763517