[lustre-discuss] Failure migrating OSTs in KVM lustre 2.7.0 testbed

Wed Nov 29 15:09:23 PST 2017

Hi folks,

In an effort to replicate a production environment to do a test upgrade, I've created a six server KVM testbed on a Centos 7.4 host with CentOS 6 guests.    I have four OSS and two MDSs.  I have qcow2 virtual disks visible to the servers in pairs.  Each OSS has two OSTs and can also mount its paired server's two OSTs.  I have separate MGT and MGT volumes, again, both visible and mountable by either MDS.  When I unmount an OST from one of the OSSs and try to mount it on what will be its HA pair (failing over manually now until I get it working, then I'll install corosync and pacemaker), the second guest to mount the OST *occasionally* fails as follows:

[root at fakeoss4 ~]# mount /mnt/OST7
mount.lustre: increased /sys/block/vde/queue/max_sectors_kb from 1024 to 2147483647
mount.lustre: mount /dev/vde at /mnt/OST7 failed: No such file or directory
Is the MGS specification correct?
Is the filesystem name correct?
If upgrading, is the copied client log valid? (see upgrade docs)

And, from /var/log/messages:

Nov 29 10:55:33 fakeoss4 kernel: LDISKFS-fs (vdd): mounted filesystem with ordered data mode. quota=on. Opts:
Nov 29 10:55:33 fakeoss4 kernel: LustreError: 2326:0:(llog_osd.c:236:llog_osd_read_header()) fake-OST0006-osd: bad log fake-OST0006 [0xa:0x10:0x0] header magic: 0x0 (expected 0x10645539)
Nov 29 10:55:33 fakeoss4 kernel: LustreError: 2326:0:(mgc_request.c:1739:mgc_llog_local_copy()) MGC192.168.122.5 at tcp: failed to copy remote log fake-OST0006: rc = -5
Nov 29 10:55:33 fakeoss4 kernel: LustreError: 13a-8: Failed to get MGS log fake-OST0006 and no local copy.
Nov 29 10:55:33 fakeoss4 kernel: LustreError: 15c-8: MGC192.168.122.5 at tcp: The configuration from log 'fake-OST0006' failed (-2). This may be the result of communication errors between this node and the MGS, a bad configuration, or other errors. See the syslog for more information.
Nov 29 10:55:33 fakeoss4 kernel: LustreError: 2326:0:(obd_mount_server.c:1299:server_start_targets()) failed to start server fake-OST0006: -2
Nov 29 10:55:33 fakeoss4 kernel: LustreError: 2326:0:(obd_mount_server.c:1783:server_fill_super()) Unable to start targets: -2
Nov 29 10:55:33 fakeoss4 kernel: LustreError: 2326:0:(obd_mount_server.c:1498:server_put_super()) no obd fake-OST0006
Nov 29 10:55:34 fakeoss4 kernel: Lustre: server umount fake-OST0006 complete
Nov 29 10:55:34 fakeoss4 kernel: LustreError: 2326:0:(obd_mount.c:1339:lustre_fill_super()) Unable to mount  (-2)

The OSS that fails to mount can see the MGS in question:

[root at fakeoss4 ~]# lctl ping 192.168.122.5
12345-0 at lo
12345-192.168.122.5 at tcp

The environment was built as follows:  A guest VM was installed from CentOS-6.5 install media. The kernel was then updated to 2.6.32-504.8.1.el6_lustre.x86_64 from the Intel repos,.  The intel binary rpms for lustre were then installed.  "exclude=kernel*" was added to /etc/yum.repos.d and a "yum update" was run, so its an up to day system with the exception of the locked down kernel.  e2fsprogs-1.42.12.wc1-7.el6.x86_64 is the version installed.  The VM was then cloned to make the six lustre servers and the filesystems were created with the following options:

[root at fakemds1 ~]# mkfs.lustre --fsname=fake --mgs --servicenode=192.168.122.5 at tcp0 --servicenode=192.168.122.67 at tcp0 /dev/vdb

[root at fakemds1 ~]# mkfs.lustre --reformat --fsname=fake --mdt --index=0 --servicenode=192.168.122.5 at tcp0 --servicenode=192.168.122.67 at tcp0 --mgsnode=192.168.122.5 at tcp0:192.168.122.67 at tcp0 /dev/vdc

[root at fakeoss1 ~]# mkfs.lustre --reformat --fsname=fake --ost --index=0 --servicenode=192.168.122.197 at tcp0 --servicenode=192.168.122.238 at tcp0 --mgsnode=192.168.122.5 at tcp0:192.168.122.67 at tcp0 /dev/vdb #repeated for 3 more OTSs with changed index and devices appropriately

[root at fakeoss3 ~]# mkfs.lustre --reformat --fsname=fake --ost --index=4 --servicenode=192.168.122.97 at tcp0 --servicenode=192.168.122.221 at tcp0 --mgsnode=192.168.122.5 at tcp0:192.168.122.67 at tcp0 /dev/vdb #repeated for 3 more OTSs with changed index and devices appropriately

Virtual disks were set as shareable and made visible to their correct VMs and often do mount, but occasionally (more than half the time) fail as above.  Have I missed any important information that could point to the cause?

Once I get this VM environment stable, I intend to update it to lustre 2.10.1.  Thanks in advance for any troubleshooting tips you can provide.

Cheers

Scott
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20171129/ca9cfe44/attachment-0001.html>