<html>
<head>
<meta http-equiv="Content-Type" content="text/html;
charset=windows-1252">
</head>
<body text="#000000" bgcolor="#FFFFFF">
<p>Ok. So when you say 'occasionally' does that mean if you try the
command again, it works?</p>
<p>If so, I'm wondering if you are doing it before the timeout
period has expired, so lustre is still expecting the OST to be on
the original OSS. That is, it is still in a window where "maybe it
will come back".</p>
<p><br>
</p>
<p>Brian Andrus<br>
</p>
<br>
<div class="moz-cite-prefix">On 11/29/2017 3:09 PM, Scott Wood
wrote:<br>
</div>
<blockquote type="cite"
cite="mid:SLXP216MB084604D4CF279D6E9A66712CAA3B0@SLXP216MB0846.KORP216.PROD.OUTLOOK.COM">
<meta http-equiv="Content-Type" content="text/html;
charset=windows-1252">
<style type="text/css" style="display:none;"><!-- P {margin-top:0;margin-bottom:0;} --></style>
<div id="divtagdefaultwrapper" dir="ltr" style="font-size: 12pt;
color: rgb(0, 0, 0); font-family: Calibri, Helvetica,
sans-serif, EmojiFont, "Apple Color Emoji",
"Segoe UI Emoji", NotoColorEmoji, "Segoe UI
Symbol", "Android Emoji", EmojiSymbols;">
<p style="margin-top:0; margin-bottom:0">Hi folks,</p>
<p style="margin-top:0; margin-bottom:0"><br>
</p>
<p style="margin-top:0; margin-bottom:0">In an effort to
replicate a production environment to do a test upgrade, I've
created a six server KVM testbed on a Centos 7.4 host with
CentOS 6 guests.
<span> I have four OSS and two MDSs. I have qcow2 virtual
disks visible to the servers in pairs. Each OSS has two
OSTs and can also mount its paired server's two OSTs. I
have separate MGT and MGT volumes, again, both visible and
mountable by either MDS. When I unmount an OST from one of
the OSSs and try to mount it on what will be its HA pair
(failing over manually now until I get it working, then I'll
install corosync and pacemaker), the second guest to mount
the OST *occasionally* fails as follows:</span></p>
<p style="margin-top:0; margin-bottom:0"><span><br>
</span></p>
<p style="margin-top:0; margin-bottom:0"><span></span></p>
<div>[root@fakeoss4 ~]# mount /mnt/OST7</div>
<div>mount.lustre: increased /sys/block/vde/queue/max_sectors_kb
from 1024 to 2147483647</div>
<div>mount.lustre: mount /dev/vde at /mnt/OST7 failed: No such
file or directory</div>
<div>Is the MGS specification correct?</div>
<div>Is the filesystem name correct?</div>
<div>If upgrading, is the copied client log valid? (see upgrade
docs)</div>
<div><br>
</div>
And, from /var/log/messages:
<p style="margin-top:0; margin-bottom:0"><span></span></p>
<div>Nov 29 10:55:33 fakeoss4 kernel: LDISKFS-fs (vdd): mounted
filesystem with ordered data mode. quota=on. Opts: </div>
<div>Nov 29 10:55:33 fakeoss4 kernel: LustreError:
2326:0:(llog_osd.c:236:llog_osd_read_header())
fake-OST0006-osd: bad log fake-OST0006 [0xa:0x10:0x0] header
magic: 0x0 (expected 0x10645539)</div>
<div>Nov 29 10:55:33 fakeoss4 kernel: LustreError:
2326:0:(mgc_request.c:1739:mgc_llog_local_copy())
MGC192.168.122.5@tcp: failed to copy remote log fake-OST0006:
rc = -5</div>
<div>Nov 29 10:55:33 fakeoss4 kernel: LustreError: 13a-8: Failed
to get MGS log fake-OST0006 and no local copy.</div>
<div>Nov 29 10:55:33 fakeoss4 kernel: LustreError: 15c-8:
MGC192.168.122.5@tcp: The configuration from log
'fake-OST0006' failed (-2). This may be the result of
communication errors between this node and the MGS, a bad
configuration, or other errors. See the syslog for more
information.</div>
<div>Nov 29 10:55:33 fakeoss4 kernel: LustreError:
2326:0:(obd_mount_server.c:1299:server_start_targets()) failed
to start server fake-OST0006: -2</div>
<div>Nov 29 10:55:33 fakeoss4 kernel: LustreError:
2326:0:(obd_mount_server.c:1783:server_fill_super()) Unable to
start targets: -2</div>
<div>Nov 29 10:55:33 fakeoss4 kernel: LustreError:
2326:0:(obd_mount_server.c:1498:server_put_super()) no obd
fake-OST0006</div>
<div>Nov 29 10:55:34 fakeoss4 kernel: Lustre: server umount
fake-OST0006 complete</div>
<div>Nov 29 10:55:34 fakeoss4 kernel: LustreError:
2326:0:(obd_mount.c:1339:lustre_fill_super()) Unable to mount
(-2)</div>
<div><br>
</div>
<div>The OSS that fails to mount can see the MGS in question:</div>
<div><br>
</div>
<div>
<div>[root@fakeoss4 ~]# lctl ping 192.168.122.5</div>
<div>12345-0@lo</div>
<div>12345-192.168.122.5@tcp</div>
<div><br>
</div>
</div>
<p style="margin-top:0; margin-bottom:0"><span><span>The
environment was built as follows: A guest VM was
installed from CentOS-6.5 install media. </span>The kernel
was then updated to <span>2.6.32-504.8.1.el6_lustre.x86_64
from the Intel repos,. The intel binary rpms for lustre
were then installed. "exclude=kernel*" was added to
/etc/yum.repos.d and a "yum update" was run, so its an up
to day system with the exception of the locked down
kernel. <span> e2fsprogs-1.42.12.wc1-7.el6.x86_64 is the
version installed. The VM was then cloned to make the
six lustre servers and the filesystems were created with
the following options:</span></span></span></p>
<p style="margin-top:0; margin-bottom:0"><span><br>
</span></p>
<p style="margin-top:0; margin-bottom:0"><span><span>[root@fakemds1
~]# </span>mkfs.lustre --fsname=fake --mgs
--servicenode=192.168.122.5@tcp0
--servicenode=192.168.122.67@tcp0 /dev/vdb</span><br>
</p>
<p style="margin-top:0; margin-bottom:0"><span><span><span>[root@fakemds1
~]# </span>mkfs.lustre --reformat --fsname=fake --mdt
--index=0 --servicenode=192.168.122.5@tcp0
--servicenode=192.168.122.67@tcp0
--mgsnode=192.168.122.5@tcp0:192.168.122.67@tcp0 /dev/vdc</span><br>
</span></p>
<p style="margin-top:0; margin-bottom:0"><span><span><br>
</span></span></p>
<p style="margin-top:0; margin-bottom:0"><span><span><span><span>[root@fakeoss1
~]# </span>mkfs.lustre --reformat --fsname=fake --ost
--index=0 --servicenode=192.168.122.197@tcp0
--servicenode=192.168.122.238@tcp0
--mgsnode=192.168.122.5@tcp0:192.168.122.67@tcp0
/dev/vdb #repeated for 3 more OTSs with changed index
and devices appropriately</span><br>
</span></span></p>
<p style="margin-top:0; margin-bottom:0"><span><span><span><span><br>
</span></span></span></span></p>
<p style="margin-top:0; margin-bottom:0"><span><span><span><span><span>[root@fakeoss3
~]# </span>mkfs.lustre --reformat --fsname=fake
--ost --index=4 --servicenode=192.168.122.97@tcp0
--servicenode=192.168.122.221@tcp0
--mgsnode=192.168.122.5@tcp0:192.168.122.67@tcp0
/dev/vdb <span>#repeated for 3 more OTSs with changed
index and devices appropriately</span></span><br>
</span></span></span></p>
<p style="margin-top:0; margin-bottom:0"><span><span><span><span><br>
</span></span></span></span></p>
<p style="margin-top:0; margin-bottom:0"><span><span><span><span>Virtual
disks were set as shareable and made visible to their
correct VMs and often do mount, but occasionally (more
than half the time) fail as above. Have I missed any
important information that could point to the cause?</span></span></span></span></p>
<p style="margin-top:0; margin-bottom:0"><span><span><span><br>
</span></span></span></p>
<p style="margin-top:0; margin-bottom:0"><span><span><span>Once
I get this VM environment stable, I intend to update it
to lustre 2.10.1. Thanks in advance for any
troubleshooting tips you can provide.</span></span></span></p>
<p style="margin-top:0; margin-bottom:0"><span><span><span><br>
</span></span></span></p>
<p style="margin-top:0; margin-bottom:0"><span><span><span>Cheers</span></span></span></p>
<p style="margin-top:0; margin-bottom:0"><span><span><span>Scott</span></span></span></p>
</div>
<br>
<fieldset class="mimeAttachmentHeader"></fieldset>
<br>
<pre wrap="">_______________________________________________
lustre-discuss mailing list
<a class="moz-txt-link-abbreviated" href="mailto:lustre-discuss@lists.lustre.org">lustre-discuss@lists.lustre.org</a>
<a class="moz-txt-link-freetext" href="http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org">http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org</a>
</pre>
</blockquote>
<br>
</body>
</html>