[lustre-discuss] Failure migrating OSTs in KVM lustre 2.7.0 testbed

Wed Nov 29 15:51:58 PST 2017

Ok. So when you say 'occasionally' does that mean if you try the command 
again, it works?

If so, I'm wondering if you are doing it before the timeout period has 
expired, so lustre is still expecting the OST to be on the original OSS. 
That is, it is still in a window where "maybe it will come back".

Brian Andrus

On 11/29/2017 3:09 PM, Scott Wood wrote:
>
> Hi folks,
>
>
> In an effort to replicate a production environment to do a test 
> upgrade, I've created a six server KVM testbed on a Centos 7.4 host 
> with CentOS 6 guests.   I have four OSS and two MDSs.  I have qcow2 
> virtual disks visible to the servers in pairs.  Each OSS has two OSTs 
> and can also mount its paired server's two OSTs.  I have separate MGT 
> and MGT volumes, again, both visible and mountable by either MDS.  
> When I unmount an OST from one of the OSSs and try to mount it on what 
> will be its HA pair (failing over manually now until I get it working, 
> then I'll install corosync and pacemaker), the second guest to mount 
> the OST *occasionally* fails as follows:
>
>
> [root at fakeoss4 ~]# mount /mnt/OST7
> mount.lustre: increased /sys/block/vde/queue/max_sectors_kb from 1024 
> to 2147483647
> mount.lustre: mount /dev/vde at /mnt/OST7 failed: No such file or 
> directory
> Is the MGS specification correct?
> Is the filesystem name correct?
> If upgrading, is the copied client log valid? (see upgrade docs)
>
> And, from /var/log/messages:
>
> Nov 29 10:55:33 fakeoss4 kernel: LDISKFS-fs (vdd): mounted filesystem 
> with ordered data mode. quota=on. Opts:
> Nov 29 10:55:33 fakeoss4 kernel: LustreError: 
> 2326:0:(llog_osd.c:236:llog_osd_read_header()) fake-OST0006-osd: bad 
> log fake-OST0006 [0xa:0x10:0x0] header magic: 0x0 (expected 0x10645539)
> Nov 29 10:55:33 fakeoss4 kernel: LustreError: 
> 2326:0:(mgc_request.c:1739:mgc_llog_local_copy()) 
> MGC192.168.122.5 at tcp: failed to copy remote log fake-OST0006: rc = -5
> Nov 29 10:55:33 fakeoss4 kernel: LustreError: 13a-8: Failed to get MGS 
> log fake-OST0006 and no local copy.
> Nov 29 10:55:33 fakeoss4 kernel: LustreError: 15c-8: 
> MGC192.168.122.5 at tcp: The configuration from log 'fake-OST0006' failed 
> (-2). This may be the result of communication errors between this node 
> and the MGS, a bad configuration, or other errors. See the syslog for 
> more information.
> Nov 29 10:55:33 fakeoss4 kernel: LustreError: 
> 2326:0:(obd_mount_server.c:1299:server_start_targets()) failed to 
> start server fake-OST0006: -2
> Nov 29 10:55:33 fakeoss4 kernel: LustreError: 
> 2326:0:(obd_mount_server.c:1783:server_fill_super()) Unable to start 
> targets: -2
> Nov 29 10:55:33 fakeoss4 kernel: LustreError: 
> 2326:0:(obd_mount_server.c:1498:server_put_super()) no obd fake-OST0006
> Nov 29 10:55:34 fakeoss4 kernel: Lustre: server umount fake-OST0006 
> complete
> Nov 29 10:55:34 fakeoss4 kernel: LustreError: 
> 2326:0:(obd_mount.c:1339:lustre_fill_super()) Unable to mount (-2)
>
> The OSS that fails to mount can see the MGS in question:
>
> [root at fakeoss4 ~]# lctl ping 192.168.122.5
> 12345-0 at lo
> 12345-192.168.122.5 at tcp
>
> The environment was built as follows:  A guest VM was installed from 
> CentOS-6.5 install media. The kernel was then updated to 
> 2.6.32-504.8.1.el6_lustre.x86_64 from the Intel repos,.  The intel 
> binary rpms for lustre were then installed.  "exclude=kernel*" was 
> added to /etc/yum.repos.d and a "yum update" was run, so its an up to 
> day system with the exception of the locked down kernel. 
>  e2fsprogs-1.42.12.wc1-7.el6.x86_64 is the version installed.  The VM 
> was then cloned to make the six lustre servers and the filesystems 
> were created with the following options:
>
>
> [root at fakemds1 ~]# mkfs.lustre --fsname=fake --mgs 
> --servicenode=192.168.122.5 at tcp0 --servicenode=192.168.122.67 at tcp0 
> /dev/vdb
>
> [root at fakemds1 ~]# mkfs.lustre --reformat --fsname=fake --mdt 
> --index=0 --servicenode=192.168.122.5 at tcp0 
> --servicenode=192.168.122.67 at tcp0 
> --mgsnode=192.168.122.5 at tcp0:192.168.122.67 at tcp0 /dev/vdc
>
>
> [root at fakeoss1 ~]# mkfs.lustre --reformat --fsname=fake --ost 
> --index=0 --servicenode=192.168.122.197 at tcp0 
> --servicenode=192.168.122.238 at tcp0 
> --mgsnode=192.168.122.5 at tcp0:192.168.122.67 at tcp0 /dev/vdb #repeated 
> for 3 more OTSs with changed index and devices appropriately
>
>
> [root at fakeoss3 ~]# mkfs.lustre --reformat --fsname=fake --ost 
> --index=4 --servicenode=192.168.122.97 at tcp0 
> --servicenode=192.168.122.221 at tcp0 
> --mgsnode=192.168.122.5 at tcp0:192.168.122.67 at tcp0 /dev/vdb #repeated 
> for 3 more OTSs with changed index and devices appropriately
>
>
> Virtual disks were set as shareable and made visible to their correct 
> VMs and often do mount, but occasionally (more than half the time) 
> fail as above.  Have I missed any important information that could 
> point to the cause?
>
>
> Once I get this VM environment stable, I intend to update it to lustre 
> 2.10.1.  Thanks in advance for any troubleshooting tips you can provide.
>
>
> Cheers
>
> Scott
>
>
>
> _______________________________________________
> lustre-discuss mailing list
> lustre-discuss at lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20171129/18aabf31/attachment-0001.html>