[Lustre-discuss] Disappearing OSTs

jrs botemout at gmail.com
Thu May 1 08:52:41 PDT 2008


Greetings,

I've posted before but no one responded. I'm reposting because I'm
really dead in the water here until I can get this fixed.

The issue is that my OSTs don't survive a reboot of the OSS.

In the below I'm dealing with two OSTs, quad-core Intel Xeon machines
with 8Gigs memory and dual port Qlogic fiber channel card.  They both
run SLES 10.1 and lustre 1.6.4.3.  My two MDS (similiar, though not
exactly same hardware), don't have the same problem, though I'm only
accessing a single MDT from them.

I've produced the problem by something as simple as running
umount /mnt/lustre/ost/ost_oss01_lustre0102_01
tune2fs -O +mmp /dev/mapper/ost_oss01_lustre0102_01
mount -t lustre /dev/mapper/ost_oss01_lustre0102_01 /mnt/lustre/ost/ost_oss01_lustre0102_01

This may have multiple causes.
Are the mount options correct?
Check the syslog for more info.

When I look at the partition table with parted I see that it's changed
from loop to gpt (as shown below).

But the simpliest case is:

oss01:/net/lmd01/space/lustre # mkfs.lustre --reformat --fsname i3_lfs3 --ost --failnode oss02 --mgsnode mds01 --mgsnode mds02 
/dev/mapper/ost_oss01_lustre0102_01

oss01:/net/lmd01/space/lustre # reboot

# log in

oss01:/net/lmd01/space/lustre # mount -t lustre /dev/mapper/ost_oss01_lustre0102_01 /mnt/lustre/ost/ost_oss01_lustre0102_01
mount.lustre: mount /dev/mapper/ost_oss01_lustre0102_01 at /mnt/lustre/ost/ost_oss01_lustre0102_01 failed: Invalid argument
This may have multiple causes.
Are the mount options correct?
Check the syslog for more info.

oss01:/net/lmd01/space/lustre # dumpe2fs -h /dev/mapper/ost_oss01_lustre0102_01 |grep feature
dumpe2fs 1.40.4.cfs1 (31-Dec-2007)
dumpe2fs: Bad magic number in super-block while trying to open /dev/mapper/ost_oss01_lustre0102_01

# another example, I re-run mkfs.lustre on the above device and mount
# it and 2 other OSTs on the second OSS

oss02:/net/lmd01/space/lustre # df|egrep 'File|ost'
Filesystem           1K-blocks      Used Available Use% Mounted on
/dev/mapper/ost_oss01_lustre0102_01
                      5768201600    469544 5474724244   1% /mnt/lustre/ost/ost_oss01_lustre0102_01
/dev/mapper/ost_oss01_lustre0102_02
                      5768201600    469540 5474724248   1% /mnt/lustre/ost/ost_oss01_lustre0102_02
/dev/mapper/ost_oss02_lustre0102_01
                      5768201600    479940 5474713848   1% /mnt/lustre/ost/ost_oss02_lustre0102_01

# I reboot the first machine then

oss02:/net/lmd01/space/lustre # umount -t lustre -a

# then try to mount from first machine and ...

oss01:/net/lmd01/space/lustre # cat a
   mount -t lustre /dev/mapper/ost_oss01_lustre0102_01 /mnt/lustre/ost/ost_oss01_lustre0102_01
   mount -t lustre /dev/mapper/ost_oss01_lustre0102_02 /mnt/lustre/ost/ost_oss01_lustre0102_02
   mount -t lustre /dev/mapper/ost_oss02_lustre0102_01 /mnt/lustre/ost/ost_oss02_lustre0102_01
oss01:/net/lmd01/space/lustre # sh a
mount.lustre: mount /dev/mapper/ost_oss01_lustre0102_01 at /mnt/lustre/ost/ost_oss01_lustre0102_01 failed: Invalid argument
This may have multiple causes.
Are the mount options correct?
Check the syslog for more info.
oss01:/net/lmd01/space/lustre # df
Filesystem           1K-blocks      Used Available Use% Mounted on
/dev/cciss/c0d0p3     61022084   6398044  51524300  12% /
udev                   4089220       312   4088908   1% /dev
/dev/cciss/c0d0p1      1241220     48324   1129844   5% /boot
lmd01:/space         470387232   8296256 438196704   2% /net/lmd01/space
/dev/mapper/ost_oss01_lustre0102_02
                      5768201600    469540 5474724248   1% /mnt/lustre/ost/ost_oss01_lustre0102_02
/dev/mapper/ost_oss02_lustre0102_01
                      5768201600    479940 5474713848   1% /mnt/lustre/ost/ost_oss02_lustre0102_01

# So the device was up just fine on one machine, I umounted them and tried on the other OSS
# and the partition table has changed

oss01:/net/lmd01/space/lustre # /usr/local/sbin/parted /dev/mapper/ost_oss01_lustre0102_01
GNU Parted 1.8.8
Using /dev/mapper/ost_oss01_lustre0102_01
Welcome to GNU Parted! Type 'help' to view a list of commands.
(parted) p
Model: Linux device-mapper (dm)
Disk /dev/mapper/ost_oss01_lustre0102_01: 6001GB
Sector size (logical/physical): 512B/512B
Partition Table: gpt

Number  Start  End  Size  File system  Name  Flags

(parted) quit

# I can't just put another partition table back

(parted) mklabel
Warning: The existing disk label on /dev/mapper/ost_oss01_lustre0102_01 will be destroyed and all data on this disk will be lost. Do you want
to continue?
Yes/No? yes
New disk label type?  [gpt]? loop
(parted) p
Model: Linux device-mapper (dm)
Disk /dev/mapper/ost_oss01_lustre0102_01: 6001GB
Sector size (logical/physical): 512B/512B
Partition Table: loop

Number  Start  End  Size  File system  Flags

(parted) mkpart
File system type?  [ext2]? ext3
Start? 0
End? 6001GB
(parted) p
Error: /dev/mapper/ost_oss01_lustre0102_01: unrecognised disk label
(parted) quit

# There is nothing unusual about the device; looking at multipath

oss01:/net/lmd01/space/lustre # multipath -l|grep ost_oss01_lustre0102_01
ost_oss01_lustre0102_01 (36000402001fc14596ef496ed00000000) dm-4 NEXSAN,SATABeast

oss02:/net/lmd01/space/lustre # multipath -l|grep ost_oss01_lustre0102_01
ost_oss01_lustre0102_01 (36000402001fc14596ef496ed00000000) dm-4 NEXSAN,SATABeast


Any suggestions would be deeply appreciated.


Thanks much,
JR Smith






More information about the lustre-discuss mailing list