<div dir="ltr"><div dir="ltr" class="gmail_signature" data-smartmail="gmail_signature"><div dir="ltr"><div>Hello,</div><div><br></div><div>This is for a Lustre 2.10.3 file system with a single MDS and three OSSes. The MDS has a separate MGT and MDT both mounted on it, and each OSS has 5 OSTs that do not failover between the hosts. We use ZFS for the backend service that the devices live on for each of the Lustre targets.</div><div><br></div><div>Here is the layout of the ZFS pool digdug-meta on our MDS server containing both the MGT and MDT:</div><br>NAME                       USED  AVAIL  REFER  MOUNTPOINT<br>digdug-meta                268G   453G    96K  /digdug-meta<br>digdug-meta/lustre2-mdt0   266G   453G   266G  /digdug-meta/lustre2-mdt0<br>digdug-meta/mgs           4.10M   453G  4.10M  /digdug-meta/mgs<div><br></div><div>Yesterday, while attempting to add a new MDS server to act as a failover node for the MGT and MDT, I stopped all of the file system and all of the targets on the MDS (MGT and MDT) and OSSes. The new MDS server is 192.168.2.13@o2ib1 and the current MDS server is 192.168.2.14@o2ib1 After which, I ran the following command on the MGT and MDT:</div><div><br></div><div># tunefs.lustre --verbose --writeconf --erase-params --servicenode=192.168.2.13@o2ib1 --servicenode=192.168.2.14@o2ib1 digdug-meta/mgs<br><br># tunefs.lustre --verbose --writeconf --erase-params --mgsnode=192.168.2.13@o2ib1 --mgsnode=192.168.2.14@o2ib1 --servicenode=192.168.2.13@o2ib1 --servicenode=192.168.2.14@o2ib1 digdug-meta/lustre2-mdt0<br><br>I ran an tunefs.lustre on each of the OSTs too, which followed the pattern:</div><div><br></div><div># tunefs.lustre --verbose --writeconf --erase-params --mgsnode=192.168.2.13@o2ib1 --mgsnode=192.168.2.14@o2ib1 --servicenode=<OSS NID> digdug-ost#/lustre2</div><div><br></div><div>After I made that change, I started the MGT and MDT on the original MDS, which originally worked fine; then I started all of the OSTs, and even mounted a client, but when I tried to bring up the MGT and MDT on the new MDS node 192.168.2.13@o2ib1, it didn't work. I decided to just try and bring up the MGT and MDT back on the original MDS again and figure it out later, but now I can't get the MDT to mount on the original MDS either. I'm getting the following set of errors when trying to mount the MDT after the MGT has been mounted:<br><br>May 19 13:53:09 mds02 systemd: Starting SYSV: Part of the lustre file system....<br>May 19 13:53:09 mds02 lustre: Mounting digdug-meta/mgs on /mnt/lustre/local/MGS<br>May 19 13:53:09 mds02 lustre: mount.lustre: according to /etc/mtab digdug-meta/mgs is already mounted on /mnt/lustre/local/MGS<br>May 19 13:53:11 mds02 lustre: Mounting digdug-meta/lustre2-mdt0 on /mnt/lustre/local/lustre2-MDT0000<br>May 19 13:53:11 mds02 kernel: Lustre: MGS: Logs for fs lustre2 were removed by user request.  All servers must be restarted in order to regenerate the logs: rc = 0<br>May 19 13:53:12 mds02 kernel: LustreError: 14135:0:(llog_osd.c:262:llog_osd_read_header()) lustre2-MDT0000-osd: bad log lustre2-MDT0000 [0xa:0x7b:0x0] header magic: 0x0 (expected 0x10645539)<br>May 19 13:53:12 mds02 kernel: LustreError: 14135:0:(llog_osd.c:262:llog_osd_read_header()) Skipped 1 previous similar message<br>May 19 13:53:12 mds02 kernel: LustreError: 14135:0:(mgc_request.c:1897:mgc_llog_local_copy()) MGC192.168.2.14@o2ib1: failed to copy remote log lustre2-MDT0000: rc = -5<br>May 19 13:53:12 mds02 kernel: LustreError: 13a-8: Failed to get MGS log lustre2-MDT0000 and no local copy.<br>May 19 13:53:12 mds02 kernel: LustreError: 15c-8: MGC192.168.2.14@o2ib1: The configuration from log 'lustre2-MDT0000' failed (-2). This may be the result of communication errors between this node and the MGS, a bad configuration, or other errors. See the syslog for more information.<br>May 19 13:53:12 mds02 kernel: LustreError: 14135:0:(obd_mount_server.c:1373:server_start_targets()) failed to start server lustre2-MDT0000: -2<br>May 19 13:53:12 mds02 kernel: LustreError: 14135:0:(obd_mount_server.c:1866:server_fill_super()) Unable to start targets: -2<br>May 19 13:53:12 mds02 kernel: LustreError: 14135:0:(obd_mount_server.c:1576:server_put_super()) no obd lustre2-MDT0000<br>May 19 13:53:12 mds02 kernel: Lustre: server umount lustre2-MDT0000 complete<br>May 19 13:53:12 mds02 kernel: LustreError: 14135:0:(obd_mount.c:1506:lustre_fill_super()) Unable to mount  (-2)<br>May 19 13:53:12 mds02 lustre: mount.lustre: mount digdug-meta/lustre2-mdt0 at /mnt/lustre/local/lustre2-MDT0000 failed: No such file or directory<br>May 19 13:53:12 mds02 lustre: Is the MGS specification correct?<br>May 19 13:53:12 mds02 lustre: Is the filesystem name correct?<br>May 19 13:53:12 mds02 lustre: If upgrading, is the copied client log valid? (see upgrade docs)<br>May 19 13:53:13 mds02 systemd: lustre.service: control process exited, code=exited status=2<br>May 19 13:53:13 mds02 systemd: Failed to start SYSV: Part of the lustre file system..<br>May 19 13:53:13 mds02 systemd: Unit lustre.service entered failed state.<br>May 19 13:53:13 mds02 systemd: lustre.service failed.<br></div><div><br></div><div><br></div><div>This morning it was also discovered that the ZFS pool that contains the MGT and MDT has a permanent error that may also be impacting our ability to mount the MDT:<br><br>


# zpool status -v digdug-meta<br><br>  pool: digdug-meta<br><br> state: ONLINE<br><br>status: One or more devices has experienced an error resulting in data corruption.  Applications may be affected.<br><br>action: Restore the file in question if possible.  Otherwise restore the<br><br>entire pool from backup.<br><br>   see: <a href="http://zfsonlinux.org/msg/ZFS-8000-8A">http://zfsonlinux.org/msg/ZFS-8000-8A</a><br><br>  scan: none requested<br><br>config:<br><br><br>NAME                        STATE     READ WRITE CKSUM<br><br>digdug-meta                 ONLINE       0     0    70<br><br>  mirror-0                  ONLINE       0     0   141<br><br>    scsi-35000c5003017156b  ONLINE       0     0   141<br><br>    scsi-35000c500301715e7  ONLINE       0     0   141<br><br>    scsi-35000c5003017158b  ONLINE       0     0   141<br><br>    scsi-35000c500301716a3  ONLINE       0     0   141<br><br>  mirror-1                  ONLINE       0     0     1<br><br>    scsi-35000c5003017155f  ONLINE       0     0     1<br><br>    scsi-35000c500301715a7  ONLINE       0     0     1<br><br>    scsi-35000c5003017159b  ONLINE       0     0     1<br><br>    scsi-35000c5003017158f  ONLINE       0     0     1<br><br>errors: Permanent errors have been detected in the following files:<br><br>        digdug-meta/lustre2-mdt0:/oi.10/0xa:0x7b:0x0</div><div><br></div><div><br></div><div>I'm not sure what my next steps would be to recover this file system if at all possible, and would greatly appreciate any help from this group.</div><div><br></div><div>Thank you in advance,</div><div><br>Bob Torgerson</div></div></div></div>