[Lustre-discuss] 1.6.6 to 1.8.3 upgrade, OSS with wrong "Target" value
Roger Sersted
rs1 at aps.anl.gov
Fri Jul 16 06:43:27 PDT 2010
I didn't find the hack anywhere. I looked at what those files contained and
decided to "hack and slash". Apparently, those files are generated from data
within the filesystem system itself. A second running of writeconf displayed
the target value to be "lustre1-OST0000", which is what I didn't want. :-(
Roger S.
Wojciech Turek wrote:
> Hi Roger
>
> Where did you find this CONFIG hack?
> Did you make a copy of the CONFIG dir before followed this steps?
>
>
>
> On 15 July 2010 20:02, Roger Sersted <rs1 at aps.anl.gov
> <mailto:rs1 at aps.anl.gov>> wrote:
>
>
> I am using the ext4 RPMs. I ran the following commands on the MDS
> and OSS nodes (lustre was not running at the time):
>
>
> tune2fs -O extents,uninit_bg,dir_index /dev/XXX
> fsck -pf /dev/XXX
>
> I then started Lustre "mount -t lustre /dev/XXX /lustre" on the
> OSSes and then the MDS. The problem still persisted. I then
> shutdown Lustre by unmounting the Lustre filesystems on the MDS/OSS
> nodes.
>
> My last and most desperate step was to "hack" the CONFIG files. On
> puppy7, I did the following:
>
> 1. mount -t ldiskfs /dev/sdc /mnt
> 2. cd /mnt/CONFIG
> 3. mv lustre1-OST0000 lustre1-OST0001
> 4. vim -nb lustre1-OST0001 mountdata
> 5. I changed OST0000 to OST0001.
> 6. I verified my changes by comparing an "od -c" of before
> and after.
> 7. umount /mnt
> 8. tunefs.lustre -writeconf /dev/sdc
>
> The output of step 8 is:
>
> tunefs.lustre -writeconf /dev/sdc
>
> checking for existing Lustre data: found CONFIGS/mountdata
> Reading CONFIGS/mountdata
>
> Read previous values:
> Target: lustre1-OST0001
>
> Index: 0
> Lustre FS: lustre1
> Mount type: ldiskfs
> Flags: 0x102
> (OST writeconf )
>
> Persistent mount opts: errors=remount-ro,extents,mballoc
> Parameters: mgsnode=172.17.2.5 at o2ib
>
>
> Permanent disk data:
> Target: lustre1-OST0000
> Index: 0
> Lustre FS: lustre1
> Mount type: ldiskfs
> Flags: 0x102
> (OST writeconf )
>
> Persistent mount opts: errors=remount-ro,extents,mballoc
> Parameters: mgsnode=172.17.2.5 at o2ib
>
> Writing CONFIGS/mountdata
>
> Now part of the system seems to have the correct Target value.
>
> Thanks for your time on this.
>
> Roger S.
>
> Wojciech Turek wrote:
>
> Hi Roger,
>
> the Lustre 1.8.3 for RHEL5 has to set of RPMS one set for old
> style ext3 based ldiskfs and one set for the ext4 based ldiskfs.
> When upgrading from 1.6.6 to 1.8.3 I think you should not try to
> use the ext4 based packages, can you let us know which RPMs have
> you used?
>
>
>
> On 15 July 2010 16:14, Roger Sersted <rs1 at aps.anl.gov
> <mailto:rs1 at aps.anl.gov> <mailto:rs1 at aps.anl.gov
> <mailto:rs1 at aps.anl.gov>>> wrote:
>
>
>
> Wojciech Turek wrote:
>
> can you also please post output of 'rpm -qa | grep
> lustre' run
> on puppy5-7 ?
>
>
>
> [root at puppy5 log]# rpm -qa |grep -i lustre
> kernel-2.6.18-164.11.1.el5_lustre.1.8.3
> lustre-1.8.3-2.6.18_164.11.1.el5_lustre.1.8.3
> lustre-ldiskfs-3.0.9-2.6.18_164.11.1.el5_lustre.1.8.3
> mft-2.6.0-2.6.18_164.11.1.el5_lustre.1.8.3
> lustre-modules-1.8.3-2.6.18_164.11.1.el5_lustre.1.8.3
>
> [root at puppy6 log]# rpm -qa | grep -i lustre
> kernel-2.6.18-164.11.1.el5_lustre.1.8.3
> lustre-1.8.3-2.6.18_164.11.1.el5_lustre.1.8.3
> lustre-ldiskfs-3.0.9-2.6.18_164.11.1.el5_lustre.1.8.3
> mft-2.6.0-2.6.18_164.11.1.el5_lustre.1.8.3
> lustre-modules-1.8.3-2.6.18_164.11.1.el5_lustre.1.8.3
>
> [root at puppy7 CONFIGS]# rpm -qa | grep -i lustre
> kernel-2.6.18-164.11.1.el5_lustre.1.8.3
> lustre-1.8.3-2.6.18_164.11.1.el5_lustre.1.8.3
> lustre-ldiskfs-3.0.9-2.6.18_164.11.1.el5_lustre.1.8.3
> mft-2.6.0-2.6.18_164.11.1.el5_lustre.1.8.3
> lustre-modules-1.8.3-2.6.18_164.11.1.el5_lustre.1.8.3
>
> Thanks,
>
> Roger S.
>
>
> On 15 July 2010 15:55, Roger Sersted <rs1 at aps.anl.gov
> <mailto:rs1 at aps.anl.gov>
> <mailto:rs1 at aps.anl.gov <mailto:rs1 at aps.anl.gov>>
> <mailto:rs1 at aps.anl.gov <mailto:rs1 at aps.anl.gov>
> <mailto:rs1 at aps.anl.gov <mailto:rs1 at aps.anl.gov>>>> wrote:
>
>
> OK. This looks bad. It appears that I should have
> upgraded
> ext3 to
> ext4, I found instructions for that,
>
> tune2fs -O extents,uninit_bg,dir_index /dev/XXX
> fsck -pf /dev/XXX
> Is the above correct? I'd like to move our
> systems to ext4. I
> didn't know those steps were necessary.
>
> Other answers listed below.
>
>
> Wojciech Turek wrote:
>
> Hi Roger,
>
> Sorry for the delay. From the ldiskfs messages I
> seem to
> me that
> you are using ext4 ldiskfs
> (Jun 26 17:54:30 puppy7 kernel: ldiskfs created from
> ext4-2.6-rhel5).
> If you upgrading from 1.6.6 you ldiskfs is ext3
> based so
> I think
> taht in lustre-1.8.3 you should use ext3 based
> ldiskfs rpm.
>
> Can you also tell us a bit more about your setup?
> From
> what you
> wrote so far I understand you have 2 OSS servers
> and each
> server
> has one OST device. In addition to that you have a
> third
> server
> which acts as a MGS/MDS, is that right?
>
> The logs you provided seem to be only from one
> server called
> puppy7 so it does not give a whole picture of the
> situation. The
> timeout messages may indicate a problem with
> communication
> between the servers but it is really difficult to
> say without
> seeing the whole picture or at least more elements
> of it.
>
> To check if you have correct rpms installed can you
> please run
> 'rpm -qa | grep lustre' on both OSS servers and
> the MDS?
>
> Also please provide output from command 'lctl
> list_nids'
> run on
> both OSS servers, MDS and a client?
>
>
> puppy5 (MDS/MGS)
>
> 172.17.2.5 at o2ib
> 172.16.2.5 at tcp
>
> puppy6 (OSS)
> 172.17.2.6 at o2ib
> 172.16.2.6 at tcp
>
> puppy7 (OSS)
> 172.17.2.7 at o2ib
> 172.16.2.7 at tcp
>
>
>
>
> In addition to above please run following command
> on all
> lustre
> targets (OSTs and MDT) to display your current lustre
> configuration
>
> tunefs.lustre --dryrun --print /dev/<ost_device>
>
>
> puppy5 (MDS/MGS)
> Read previous values:
> Target: lustre1-MDT0000
> Index: 0
> Lustre FS: lustre1
> Mount type: ldiskfs
> Flags: 0x405
> (MDT MGS )
> Persistent mount opts:
> errors=remount-ro,iopen_nopriv,user_xattr
> Parameters: lov.stripesize=125K lov.stripecount=2
> mdt.group_upcall=/usr/sbin/l_getgroups
> mdt.group_upcall=NONE
> mdt.group_upcall=NONE
>
>
> Permanent disk data:
> Target: lustre1-MDT0000
> Index: 0
> Lustre FS: lustre1
> Mount type: ldiskfs
> Flags: 0x405
> (MDT MGS )
> Persistent mount opts:
> errors=remount-ro,iopen_nopriv,user_xattr
> Parameters: lov.stripesize=125K lov.stripecount=2
> mdt.group_upcall=/usr/sbin/l_getgroups
> mdt.group_upcall=NONE
> mdt.group_upcall=NONE
>
> exiting before disk write.
> ----------------------------------------------------
> puppy6
> checking for existing Lustre data: found CONFIGS/mountdata
> Reading CONFIGS/mountdata
>
> Read previous values:
> Target: lustre1-OST0000
> Index: 0
> Lustre FS: lustre1
> Mount type: ldiskfs
> Flags: 0x2
> (OST )
> Persistent mount opts: errors=remount-ro,extents,mballoc
> Parameters: mgsnode=172.17.2.5 at o2ib
>
>
> Permanent disk data:
> Target: lustre1-OST0000
> Index: 0
> Lustre FS: lustre1
> Mount type: ldiskfs
> Flags: 0x2
> (OST )
> Persistent mount opts: errors=remount-ro,extents,mballoc
> Parameters: mgsnode=172.17.2.5 at o2ib
> --------------------------------------------------
> puppy7 (this is the broken OSS. The "Target" should be
> "lustre1-OST0001")
> checking for existing Lustre data: found CONFIGS/mountdata
> Reading CONFIGS/mountdata
>
> Read previous values:
> Target: lustre1-OST0000
> Index: 0
> Lustre FS: lustre1
> Mount type: ldiskfs
> Flags: 0x2
> (OST )
> Persistent mount opts: errors=remount-ro,extents,mballoc
> Parameters: mgsnode=172.17.2.5 at o2ib
>
>
> Permanent disk data:
> Target: lustre1-OST0000
> Index: 0
> Lustre FS: lustre1
> Mount type: ldiskfs
> Flags: 0x2
> (OST )
> Persistent mount opts: errors=remount-ro,extents,mballoc
> Parameters: mgsnode=172.17.2.5 at o2ib
>
> exiting before disk write.
>
>
>
> If possible please attach syslog from each machine
> from
> the time
> you mounted lustre targets (OST and MDT).
>
> Best regards,
>
> Wojciech
>
> On 14 July 2010 20:46, Roger Sersted
> <rs1 at aps.anl.gov <mailto:rs1 at aps.anl.gov>
> <mailto:rs1 at aps.anl.gov <mailto:rs1 at aps.anl.gov>>
> <mailto:rs1 at aps.anl.gov <mailto:rs1 at aps.anl.gov>
> <mailto:rs1 at aps.anl.gov <mailto:rs1 at aps.anl.gov>>>
> <mailto:rs1 at aps.anl.gov <mailto:rs1 at aps.anl.gov>
> <mailto:rs1 at aps.anl.gov <mailto:rs1 at aps.anl.gov>>
>
> <mailto:rs1 at aps.anl.gov <mailto:rs1 at aps.anl.gov>
> <mailto:rs1 at aps.anl.gov <mailto:rs1 at aps.anl.gov>>>>> wrote:
>
>
> Any additional info?
>
> Thanks,
>
> Roger S.
>
>
>
>
> -- --
> Wojciech Turek
>
>
>
>
>
> -- --
> Wojciech Turek
>
> Assistant System Manager
> 517
>
>
>
>
> --
More information about the lustre-discuss
mailing list