[lustre-discuss] MDT not mounting after tunefs.lustre changes on ZFS volumes

Bob Torgerson rltorgerson at alaska.edu
Sun May 19 15:05:30 PDT 2019


Hello,

This is for a Lustre 2.10.3 file system with a single MDS and three OSSes.
The MDS has a separate MGT and MDT both mounted on it, and each OSS has 5
OSTs that do not failover between the hosts. We use ZFS for the backend
service that the devices live on for each of the Lustre targets.

Here is the layout of the ZFS pool digdug-meta on our MDS server containing
both the MGT and MDT:

NAME                       USED  AVAIL  REFER  MOUNTPOINT
digdug-meta                268G   453G    96K  /digdug-meta
digdug-meta/lustre2-mdt0   266G   453G   266G  /digdug-meta/lustre2-mdt0
digdug-meta/mgs           4.10M   453G  4.10M  /digdug-meta/mgs

Yesterday, while attempting to add a new MDS server to act as a failover
node for the MGT and MDT, I stopped all of the file system and all of the
targets on the MDS (MGT and MDT) and OSSes. The new MDS server is
192.168.2.13 at o2ib1 and the current MDS server is 192.168.2.14 at o2ib1 After
which, I ran the following command on the MGT and MDT:

# tunefs.lustre --verbose --writeconf --erase-params
--servicenode=192.168.2.13 at o2ib1 --servicenode=192.168.2.14 at o2ib1
digdug-meta/mgs

# tunefs.lustre --verbose --writeconf --erase-params
--mgsnode=192.168.2.13 at o2ib1 --mgsnode=192.168.2.14 at o2ib1
--servicenode=192.168.2.13 at o2ib1 --servicenode=192.168.2.14 at o2ib1
digdug-meta/lustre2-mdt0

I ran an tunefs.lustre on each of the OSTs too, which followed the pattern:

# tunefs.lustre --verbose --writeconf --erase-params
--mgsnode=192.168.2.13 at o2ib1 --mgsnode=192.168.2.14 at o2ib1
--servicenode=<OSS NID> digdug-ost#/lustre2

After I made that change, I started the MGT and MDT on the original MDS,
which originally worked fine; then I started all of the OSTs, and even
mounted a client, but when I tried to bring up the MGT and MDT on the new
MDS node 192.168.2.13 at o2ib1, it didn't work. I decided to just try and
bring up the MGT and MDT back on the original MDS again and figure it out
later, but now I can't get the MDT to mount on the original MDS either. I'm
getting the following set of errors when trying to mount the MDT after the
MGT has been mounted:

May 19 13:53:09 mds02 systemd: Starting SYSV: Part of the lustre file
system....
May 19 13:53:09 mds02 lustre: Mounting digdug-meta/mgs on
/mnt/lustre/local/MGS
May 19 13:53:09 mds02 lustre: mount.lustre: according to /etc/mtab
digdug-meta/mgs is already mounted on /mnt/lustre/local/MGS
May 19 13:53:11 mds02 lustre: Mounting digdug-meta/lustre2-mdt0 on
/mnt/lustre/local/lustre2-MDT0000
May 19 13:53:11 mds02 kernel: Lustre: MGS: Logs for fs lustre2 were removed
by user request.  All servers must be restarted in order to regenerate the
logs: rc = 0
May 19 13:53:12 mds02 kernel: LustreError:
14135:0:(llog_osd.c:262:llog_osd_read_header()) lustre2-MDT0000-osd: bad
log lustre2-MDT0000 [0xa:0x7b:0x0] header magic: 0x0 (expected 0x10645539)
May 19 13:53:12 mds02 kernel: LustreError:
14135:0:(llog_osd.c:262:llog_osd_read_header()) Skipped 1 previous similar
message
May 19 13:53:12 mds02 kernel: LustreError:
14135:0:(mgc_request.c:1897:mgc_llog_local_copy()) MGC192.168.2.14 at o2ib1:
failed to copy remote log lustre2-MDT0000: rc = -5
May 19 13:53:12 mds02 kernel: LustreError: 13a-8: Failed to get MGS log
lustre2-MDT0000 and no local copy.
May 19 13:53:12 mds02 kernel: LustreError: 15c-8: MGC192.168.2.14 at o2ib1:
The configuration from log 'lustre2-MDT0000' failed (-2). This may be the
result of communication errors between this node and the MGS, a bad
configuration, or other errors. See the syslog for more information.
May 19 13:53:12 mds02 kernel: LustreError:
14135:0:(obd_mount_server.c:1373:server_start_targets()) failed to start
server lustre2-MDT0000: -2
May 19 13:53:12 mds02 kernel: LustreError:
14135:0:(obd_mount_server.c:1866:server_fill_super()) Unable to start
targets: -2
May 19 13:53:12 mds02 kernel: LustreError:
14135:0:(obd_mount_server.c:1576:server_put_super()) no obd lustre2-MDT0000
May 19 13:53:12 mds02 kernel: Lustre: server umount lustre2-MDT0000 complete
May 19 13:53:12 mds02 kernel: LustreError:
14135:0:(obd_mount.c:1506:lustre_fill_super()) Unable to mount  (-2)
May 19 13:53:12 mds02 lustre: mount.lustre: mount digdug-meta/lustre2-mdt0
at /mnt/lustre/local/lustre2-MDT0000 failed: No such file or directory
May 19 13:53:12 mds02 lustre: Is the MGS specification correct?
May 19 13:53:12 mds02 lustre: Is the filesystem name correct?
May 19 13:53:12 mds02 lustre: If upgrading, is the copied client log valid?
(see upgrade docs)
May 19 13:53:13 mds02 systemd: lustre.service: control process exited,
code=exited status=2
May 19 13:53:13 mds02 systemd: Failed to start SYSV: Part of the lustre
file system..
May 19 13:53:13 mds02 systemd: Unit lustre.service entered failed state.
May 19 13:53:13 mds02 systemd: lustre.service failed.


This morning it was also discovered that the ZFS pool that contains the MGT
and MDT has a permanent error that may also be impacting our ability to
mount the MDT:

# zpool status -v digdug-meta

  pool: digdug-meta

 state: ONLINE

status: One or more devices has experienced an error resulting in data
corruption.  Applications may be affected.

action: Restore the file in question if possible.  Otherwise restore the

entire pool from backup.

   see: http://zfsonlinux.org/msg/ZFS-8000-8A

  scan: none requested

config:


NAME                        STATE     READ WRITE CKSUM

digdug-meta                 ONLINE       0     0    70

  mirror-0                  ONLINE       0     0   141

    scsi-35000c5003017156b  ONLINE       0     0   141

    scsi-35000c500301715e7  ONLINE       0     0   141

    scsi-35000c5003017158b  ONLINE       0     0   141

    scsi-35000c500301716a3  ONLINE       0     0   141

  mirror-1                  ONLINE       0     0     1

    scsi-35000c5003017155f  ONLINE       0     0     1

    scsi-35000c500301715a7  ONLINE       0     0     1

    scsi-35000c5003017159b  ONLINE       0     0     1

    scsi-35000c5003017158f  ONLINE       0     0     1

errors: Permanent errors have been detected in the following files:

        digdug-meta/lustre2-mdt0:/oi.10/0xa:0x7b:0x0


I'm not sure what my next steps would be to recover this file system if at
all possible, and would greatly appreciate any help from this group.

Thank you in advance,

Bob Torgerson
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20190519/4fb4be32/attachment.html>


More information about the lustre-discuss mailing list