[Lustre-discuss] Disappearing OSTs
jrs
botemout at gmail.com
Thu May 1 08:52:41 PDT 2008
Greetings,
I've posted before but no one responded. I'm reposting because I'm
really dead in the water here until I can get this fixed.
The issue is that my OSTs don't survive a reboot of the OSS.
In the below I'm dealing with two OSTs, quad-core Intel Xeon machines
with 8Gigs memory and dual port Qlogic fiber channel card. They both
run SLES 10.1 and lustre 1.6.4.3. My two MDS (similiar, though not
exactly same hardware), don't have the same problem, though I'm only
accessing a single MDT from them.
I've produced the problem by something as simple as running
umount /mnt/lustre/ost/ost_oss01_lustre0102_01
tune2fs -O +mmp /dev/mapper/ost_oss01_lustre0102_01
mount -t lustre /dev/mapper/ost_oss01_lustre0102_01 /mnt/lustre/ost/ost_oss01_lustre0102_01
This may have multiple causes.
Are the mount options correct?
Check the syslog for more info.
When I look at the partition table with parted I see that it's changed
from loop to gpt (as shown below).
But the simpliest case is:
oss01:/net/lmd01/space/lustre # mkfs.lustre --reformat --fsname i3_lfs3 --ost --failnode oss02 --mgsnode mds01 --mgsnode mds02
/dev/mapper/ost_oss01_lustre0102_01
oss01:/net/lmd01/space/lustre # reboot
# log in
oss01:/net/lmd01/space/lustre # mount -t lustre /dev/mapper/ost_oss01_lustre0102_01 /mnt/lustre/ost/ost_oss01_lustre0102_01
mount.lustre: mount /dev/mapper/ost_oss01_lustre0102_01 at /mnt/lustre/ost/ost_oss01_lustre0102_01 failed: Invalid argument
This may have multiple causes.
Are the mount options correct?
Check the syslog for more info.
oss01:/net/lmd01/space/lustre # dumpe2fs -h /dev/mapper/ost_oss01_lustre0102_01 |grep feature
dumpe2fs 1.40.4.cfs1 (31-Dec-2007)
dumpe2fs: Bad magic number in super-block while trying to open /dev/mapper/ost_oss01_lustre0102_01
# another example, I re-run mkfs.lustre on the above device and mount
# it and 2 other OSTs on the second OSS
oss02:/net/lmd01/space/lustre # df|egrep 'File|ost'
Filesystem 1K-blocks Used Available Use% Mounted on
/dev/mapper/ost_oss01_lustre0102_01
5768201600 469544 5474724244 1% /mnt/lustre/ost/ost_oss01_lustre0102_01
/dev/mapper/ost_oss01_lustre0102_02
5768201600 469540 5474724248 1% /mnt/lustre/ost/ost_oss01_lustre0102_02
/dev/mapper/ost_oss02_lustre0102_01
5768201600 479940 5474713848 1% /mnt/lustre/ost/ost_oss02_lustre0102_01
# I reboot the first machine then
oss02:/net/lmd01/space/lustre # umount -t lustre -a
# then try to mount from first machine and ...
oss01:/net/lmd01/space/lustre # cat a
mount -t lustre /dev/mapper/ost_oss01_lustre0102_01 /mnt/lustre/ost/ost_oss01_lustre0102_01
mount -t lustre /dev/mapper/ost_oss01_lustre0102_02 /mnt/lustre/ost/ost_oss01_lustre0102_02
mount -t lustre /dev/mapper/ost_oss02_lustre0102_01 /mnt/lustre/ost/ost_oss02_lustre0102_01
oss01:/net/lmd01/space/lustre # sh a
mount.lustre: mount /dev/mapper/ost_oss01_lustre0102_01 at /mnt/lustre/ost/ost_oss01_lustre0102_01 failed: Invalid argument
This may have multiple causes.
Are the mount options correct?
Check the syslog for more info.
oss01:/net/lmd01/space/lustre # df
Filesystem 1K-blocks Used Available Use% Mounted on
/dev/cciss/c0d0p3 61022084 6398044 51524300 12% /
udev 4089220 312 4088908 1% /dev
/dev/cciss/c0d0p1 1241220 48324 1129844 5% /boot
lmd01:/space 470387232 8296256 438196704 2% /net/lmd01/space
/dev/mapper/ost_oss01_lustre0102_02
5768201600 469540 5474724248 1% /mnt/lustre/ost/ost_oss01_lustre0102_02
/dev/mapper/ost_oss02_lustre0102_01
5768201600 479940 5474713848 1% /mnt/lustre/ost/ost_oss02_lustre0102_01
# So the device was up just fine on one machine, I umounted them and tried on the other OSS
# and the partition table has changed
oss01:/net/lmd01/space/lustre # /usr/local/sbin/parted /dev/mapper/ost_oss01_lustre0102_01
GNU Parted 1.8.8
Using /dev/mapper/ost_oss01_lustre0102_01
Welcome to GNU Parted! Type 'help' to view a list of commands.
(parted) p
Model: Linux device-mapper (dm)
Disk /dev/mapper/ost_oss01_lustre0102_01: 6001GB
Sector size (logical/physical): 512B/512B
Partition Table: gpt
Number Start End Size File system Name Flags
(parted) quit
# I can't just put another partition table back
(parted) mklabel
Warning: The existing disk label on /dev/mapper/ost_oss01_lustre0102_01 will be destroyed and all data on this disk will be lost. Do you want
to continue?
Yes/No? yes
New disk label type? [gpt]? loop
(parted) p
Model: Linux device-mapper (dm)
Disk /dev/mapper/ost_oss01_lustre0102_01: 6001GB
Sector size (logical/physical): 512B/512B
Partition Table: loop
Number Start End Size File system Flags
(parted) mkpart
File system type? [ext2]? ext3
Start? 0
End? 6001GB
(parted) p
Error: /dev/mapper/ost_oss01_lustre0102_01: unrecognised disk label
(parted) quit
# There is nothing unusual about the device; looking at multipath
oss01:/net/lmd01/space/lustre # multipath -l|grep ost_oss01_lustre0102_01
ost_oss01_lustre0102_01 (36000402001fc14596ef496ed00000000) dm-4 NEXSAN,SATABeast
oss02:/net/lmd01/space/lustre # multipath -l|grep ost_oss01_lustre0102_01
ost_oss01_lustre0102_01 (36000402001fc14596ef496ed00000000) dm-4 NEXSAN,SATABeast
Any suggestions would be deeply appreciated.
Thanks much,
JR Smith
More information about the lustre-discuss
mailing list