[lustre-discuss] MDT not mounting after tunefs.lustre changes on ZFS volumes
Bob Torgerson
rltorgerson at alaska.edu
Sun May 19 15:05:30 PDT 2019
Hello,
This is for a Lustre 2.10.3 file system with a single MDS and three OSSes.
The MDS has a separate MGT and MDT both mounted on it, and each OSS has 5
OSTs that do not failover between the hosts. We use ZFS for the backend
service that the devices live on for each of the Lustre targets.
Here is the layout of the ZFS pool digdug-meta on our MDS server containing
both the MGT and MDT:
NAME USED AVAIL REFER MOUNTPOINT
digdug-meta 268G 453G 96K /digdug-meta
digdug-meta/lustre2-mdt0 266G 453G 266G /digdug-meta/lustre2-mdt0
digdug-meta/mgs 4.10M 453G 4.10M /digdug-meta/mgs
Yesterday, while attempting to add a new MDS server to act as a failover
node for the MGT and MDT, I stopped all of the file system and all of the
targets on the MDS (MGT and MDT) and OSSes. The new MDS server is
192.168.2.13 at o2ib1 and the current MDS server is 192.168.2.14 at o2ib1 After
which, I ran the following command on the MGT and MDT:
# tunefs.lustre --verbose --writeconf --erase-params
--servicenode=192.168.2.13 at o2ib1 --servicenode=192.168.2.14 at o2ib1
digdug-meta/mgs
# tunefs.lustre --verbose --writeconf --erase-params
--mgsnode=192.168.2.13 at o2ib1 --mgsnode=192.168.2.14 at o2ib1
--servicenode=192.168.2.13 at o2ib1 --servicenode=192.168.2.14 at o2ib1
digdug-meta/lustre2-mdt0
I ran an tunefs.lustre on each of the OSTs too, which followed the pattern:
# tunefs.lustre --verbose --writeconf --erase-params
--mgsnode=192.168.2.13 at o2ib1 --mgsnode=192.168.2.14 at o2ib1
--servicenode=<OSS NID> digdug-ost#/lustre2
After I made that change, I started the MGT and MDT on the original MDS,
which originally worked fine; then I started all of the OSTs, and even
mounted a client, but when I tried to bring up the MGT and MDT on the new
MDS node 192.168.2.13 at o2ib1, it didn't work. I decided to just try and
bring up the MGT and MDT back on the original MDS again and figure it out
later, but now I can't get the MDT to mount on the original MDS either. I'm
getting the following set of errors when trying to mount the MDT after the
MGT has been mounted:
May 19 13:53:09 mds02 systemd: Starting SYSV: Part of the lustre file
system....
May 19 13:53:09 mds02 lustre: Mounting digdug-meta/mgs on
/mnt/lustre/local/MGS
May 19 13:53:09 mds02 lustre: mount.lustre: according to /etc/mtab
digdug-meta/mgs is already mounted on /mnt/lustre/local/MGS
May 19 13:53:11 mds02 lustre: Mounting digdug-meta/lustre2-mdt0 on
/mnt/lustre/local/lustre2-MDT0000
May 19 13:53:11 mds02 kernel: Lustre: MGS: Logs for fs lustre2 were removed
by user request. All servers must be restarted in order to regenerate the
logs: rc = 0
May 19 13:53:12 mds02 kernel: LustreError:
14135:0:(llog_osd.c:262:llog_osd_read_header()) lustre2-MDT0000-osd: bad
log lustre2-MDT0000 [0xa:0x7b:0x0] header magic: 0x0 (expected 0x10645539)
May 19 13:53:12 mds02 kernel: LustreError:
14135:0:(llog_osd.c:262:llog_osd_read_header()) Skipped 1 previous similar
message
May 19 13:53:12 mds02 kernel: LustreError:
14135:0:(mgc_request.c:1897:mgc_llog_local_copy()) MGC192.168.2.14 at o2ib1:
failed to copy remote log lustre2-MDT0000: rc = -5
May 19 13:53:12 mds02 kernel: LustreError: 13a-8: Failed to get MGS log
lustre2-MDT0000 and no local copy.
May 19 13:53:12 mds02 kernel: LustreError: 15c-8: MGC192.168.2.14 at o2ib1:
The configuration from log 'lustre2-MDT0000' failed (-2). This may be the
result of communication errors between this node and the MGS, a bad
configuration, or other errors. See the syslog for more information.
May 19 13:53:12 mds02 kernel: LustreError:
14135:0:(obd_mount_server.c:1373:server_start_targets()) failed to start
server lustre2-MDT0000: -2
May 19 13:53:12 mds02 kernel: LustreError:
14135:0:(obd_mount_server.c:1866:server_fill_super()) Unable to start
targets: -2
May 19 13:53:12 mds02 kernel: LustreError:
14135:0:(obd_mount_server.c:1576:server_put_super()) no obd lustre2-MDT0000
May 19 13:53:12 mds02 kernel: Lustre: server umount lustre2-MDT0000 complete
May 19 13:53:12 mds02 kernel: LustreError:
14135:0:(obd_mount.c:1506:lustre_fill_super()) Unable to mount (-2)
May 19 13:53:12 mds02 lustre: mount.lustre: mount digdug-meta/lustre2-mdt0
at /mnt/lustre/local/lustre2-MDT0000 failed: No such file or directory
May 19 13:53:12 mds02 lustre: Is the MGS specification correct?
May 19 13:53:12 mds02 lustre: Is the filesystem name correct?
May 19 13:53:12 mds02 lustre: If upgrading, is the copied client log valid?
(see upgrade docs)
May 19 13:53:13 mds02 systemd: lustre.service: control process exited,
code=exited status=2
May 19 13:53:13 mds02 systemd: Failed to start SYSV: Part of the lustre
file system..
May 19 13:53:13 mds02 systemd: Unit lustre.service entered failed state.
May 19 13:53:13 mds02 systemd: lustre.service failed.
This morning it was also discovered that the ZFS pool that contains the MGT
and MDT has a permanent error that may also be impacting our ability to
mount the MDT:
# zpool status -v digdug-meta
pool: digdug-meta
state: ONLINE
status: One or more devices has experienced an error resulting in data
corruption. Applications may be affected.
action: Restore the file in question if possible. Otherwise restore the
entire pool from backup.
see: http://zfsonlinux.org/msg/ZFS-8000-8A
scan: none requested
config:
NAME STATE READ WRITE CKSUM
digdug-meta ONLINE 0 0 70
mirror-0 ONLINE 0 0 141
scsi-35000c5003017156b ONLINE 0 0 141
scsi-35000c500301715e7 ONLINE 0 0 141
scsi-35000c5003017158b ONLINE 0 0 141
scsi-35000c500301716a3 ONLINE 0 0 141
mirror-1 ONLINE 0 0 1
scsi-35000c5003017155f ONLINE 0 0 1
scsi-35000c500301715a7 ONLINE 0 0 1
scsi-35000c5003017159b ONLINE 0 0 1
scsi-35000c5003017158f ONLINE 0 0 1
errors: Permanent errors have been detected in the following files:
digdug-meta/lustre2-mdt0:/oi.10/0xa:0x7b:0x0
I'm not sure what my next steps would be to recover this file system if at
all possible, and would greatly appreciate any help from this group.
Thank you in advance,
Bob Torgerson
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20190519/4fb4be32/attachment.html>
More information about the lustre-discuss
mailing list