<html>

  <head>

    <meta http-equiv="Content-Type" content="text/html;

      charset=windows-1252">

  </head>

  <body text="#000000" bgcolor="#FFFFFF">

    <p>Ok. So when you say 'occasionally' does that mean if you try the

      command again, it works?</p>

    <p>If so, I'm wondering if you are doing it before the timeout

      period has expired, so lustre is still expecting the OST to be on

      the original OSS. That is, it is still in a window where "maybe it

      will come back".</p>

    <p><br>

    </p>

    <p>Brian Andrus<br>

    </p>

    <br>

    <div class="moz-cite-prefix">On 11/29/2017 3:09 PM, Scott Wood

      wrote:<br>

    </div>

    <blockquote type="cite"

cite="mid:SLXP216MB084604D4CF279D6E9A66712CAA3B0@SLXP216MB0846.KORP216.PROD.OUTLOOK.COM">

      <meta http-equiv="Content-Type" content="text/html;

        charset=windows-1252">

      <style type="text/css" style="display:none;"><!-- P {margin-top:0;margin-bottom:0;} --></style>

      <div id="divtagdefaultwrapper" dir="ltr" style="font-size: 12pt;

        color: rgb(0, 0, 0); font-family: Calibri, Helvetica,

        sans-serif, EmojiFont, "Apple Color Emoji",

        "Segoe UI Emoji", NotoColorEmoji, "Segoe UI

        Symbol", "Android Emoji", EmojiSymbols;">

        <p style="margin-top:0; margin-bottom:0">Hi folks,</p>

        <p style="margin-top:0; margin-bottom:0"><br>

        </p>

        <p style="margin-top:0; margin-bottom:0">In an effort to

          replicate a production environment to do a test upgrade, I've

          created a six server KVM testbed on a Centos 7.4 host with

          CentOS 6 guests. 

          <span>  I have four OSS and two MDSs.  I have qcow2 virtual

            disks visible to the servers in pairs.  Each OSS has two

            OSTs and can also mount its paired server's two OSTs.  I

            have separate MGT and MGT volumes, again, both visible and

            mountable by either MDS.  When I unmount an OST from one of

            the OSSs and try to mount it on what will be its HA pair

            (failing over manually now until I get it working, then I'll

            install corosync and pacemaker), the second guest to mount

            the OST *occasionally* fails as follows:</span></p>

        <p style="margin-top:0; margin-bottom:0"><span><br>

          </span></p>

        <p style="margin-top:0; margin-bottom:0"><span></span></p>

        <div>[root@fakeoss4 ~]# mount /mnt/OST7</div>

        <div>mount.lustre: increased /sys/block/vde/queue/max_sectors_kb

          from 1024 to 2147483647</div>

        <div>mount.lustre: mount /dev/vde at /mnt/OST7 failed: No such

          file or directory</div>

        <div>Is the MGS specification correct?</div>

        <div>Is the filesystem name correct?</div>

        <div>If upgrading, is the copied client log valid? (see upgrade

          docs)</div>

        <div><br>

        </div>

        And, from /var/log/messages:

        <p style="margin-top:0; margin-bottom:0"><span></span></p>

        <div>Nov 29 10:55:33 fakeoss4 kernel: LDISKFS-fs (vdd): mounted

          filesystem with ordered data mode. quota=on. Opts: </div>

        <div>Nov 29 10:55:33 fakeoss4 kernel: LustreError:

          2326:0:(llog_osd.c:236:llog_osd_read_header())

          fake-OST0006-osd: bad log fake-OST0006 [0xa:0x10:0x0] header

          magic: 0x0 (expected 0x10645539)</div>

        <div>Nov 29 10:55:33 fakeoss4 kernel: LustreError:

          2326:0:(mgc_request.c:1739:mgc_llog_local_copy())

          MGC192.168.122.5@tcp: failed to copy remote log fake-OST0006:

          rc = -5</div>

        <div>Nov 29 10:55:33 fakeoss4 kernel: LustreError: 13a-8: Failed

          to get MGS log fake-OST0006 and no local copy.</div>

        <div>Nov 29 10:55:33 fakeoss4 kernel: LustreError: 15c-8:

          MGC192.168.122.5@tcp: The configuration from log

          'fake-OST0006' failed (-2). This may be the result of

          communication errors between this node and the MGS, a bad

          configuration, or other errors. See the syslog for more

          information.</div>

        <div>Nov 29 10:55:33 fakeoss4 kernel: LustreError:

          2326:0:(obd_mount_server.c:1299:server_start_targets()) failed

          to start server fake-OST0006: -2</div>

        <div>Nov 29 10:55:33 fakeoss4 kernel: LustreError:

          2326:0:(obd_mount_server.c:1783:server_fill_super()) Unable to

          start targets: -2</div>

        <div>Nov 29 10:55:33 fakeoss4 kernel: LustreError:

          2326:0:(obd_mount_server.c:1498:server_put_super()) no obd

          fake-OST0006</div>

        <div>Nov 29 10:55:34 fakeoss4 kernel: Lustre: server umount

          fake-OST0006 complete</div>

        <div>Nov 29 10:55:34 fakeoss4 kernel: LustreError:

          2326:0:(obd_mount.c:1339:lustre_fill_super()) Unable to mount 

          (-2)</div>

        <div><br>

        </div>

        <div>The OSS that fails to mount can see the MGS in question:</div>

        <div><br>

        </div>

        <div>

          <div>[root@fakeoss4 ~]# lctl ping 192.168.122.5</div>

          <div>12345-0@lo</div>

          <div>12345-192.168.122.5@tcp</div>

          <div><br>

          </div>

        </div>

        <p style="margin-top:0; margin-bottom:0"><span><span>The

              environment was built as follows:  A guest VM was

              installed from CentOS-6.5 install media. </span>The kernel

            was then updated to <span>2.6.32-504.8.1.el6_lustre.x86_64

              from the Intel repos,.  The intel binary rpms for lustre

              were then installed.  "exclude=kernel*" was added to

              /etc/yum.repos.d and a "yum update" was run, so its an up

              to day system with the exception of the locked down

              kernel. <span> e2fsprogs-1.42.12.wc1-7.el6.x86_64 is the

                version installed.  The VM was then cloned to make the

                six lustre servers and the filesystems were created with

                the following options:</span></span></span></p>

        <p style="margin-top:0; margin-bottom:0"><span><br>

          </span></p>

        <p style="margin-top:0; margin-bottom:0"><span><span>[root@fakemds1

              ~]# </span>mkfs.lustre --fsname=fake --mgs

            --servicenode=192.168.122.5@tcp0

            --servicenode=192.168.122.67@tcp0 /dev/vdb</span><br>

        </p>

        <p style="margin-top:0; margin-bottom:0"><span><span><span>[root@fakemds1

                ~]# </span>mkfs.lustre --reformat --fsname=fake --mdt

              --index=0 --servicenode=192.168.122.5@tcp0

              --servicenode=192.168.122.67@tcp0

              --mgsnode=192.168.122.5@tcp0:192.168.122.67@tcp0 /dev/vdc</span><br>

          </span></p>

        <p style="margin-top:0; margin-bottom:0"><span><span><br>

            </span></span></p>

        <p style="margin-top:0; margin-bottom:0"><span><span><span><span>[root@fakeoss1

                  ~]# </span>mkfs.lustre --reformat --fsname=fake --ost

                --index=0 --servicenode=192.168.122.197@tcp0

                --servicenode=192.168.122.238@tcp0

                --mgsnode=192.168.122.5@tcp0:192.168.122.67@tcp0

                /dev/vdb #repeated for 3 more OTSs with changed index

                and devices appropriately</span><br>

            </span></span></p>

        <p style="margin-top:0; margin-bottom:0"><span><span><span><span><br>

                </span></span></span></span></p>

        <p style="margin-top:0; margin-bottom:0"><span><span><span><span><span>[root@fakeoss3

                    ~]# </span>mkfs.lustre --reformat --fsname=fake

                  --ost --index=4 --servicenode=192.168.122.97@tcp0

                  --servicenode=192.168.122.221@tcp0

                  --mgsnode=192.168.122.5@tcp0:192.168.122.67@tcp0

                  /dev/vdb <span>#repeated for 3 more OTSs with changed

                    index and devices appropriately</span></span><br>

              </span></span></span></p>

        <p style="margin-top:0; margin-bottom:0"><span><span><span><span><br>

                </span></span></span></span></p>

        <p style="margin-top:0; margin-bottom:0"><span><span><span><span>Virtual

                  disks were set as shareable and made visible to their

                  correct VMs and often do mount, but occasionally (more

                  than half the time) fail as above.  Have I missed any

                  important information that could point to the cause?</span></span></span></span></p>

        <p style="margin-top:0; margin-bottom:0"><span><span><span><br>

              </span></span></span></p>

        <p style="margin-top:0; margin-bottom:0"><span><span><span>Once

                I get this VM environment stable, I intend to update it

                to lustre 2.10.1.  Thanks in advance for any

                troubleshooting tips you can provide.</span></span></span></p>

        <p style="margin-top:0; margin-bottom:0"><span><span><span><br>

              </span></span></span></p>

        <p style="margin-top:0; margin-bottom:0"><span><span><span>Cheers</span></span></span></p>

        <p style="margin-top:0; margin-bottom:0"><span><span><span>Scott</span></span></span></p>

      </div>

      <br>

      <fieldset class="mimeAttachmentHeader"></fieldset>

      <br>

      <pre wrap="">_______________________________________________

lustre-discuss mailing list

<a class="moz-txt-link-abbreviated" href="mailto:lustre-discuss@lists.lustre.org">lustre-discuss@lists.lustre.org</a>

<a class="moz-txt-link-freetext" href="http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org">http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org</a>

</pre>

    </blockquote>

    <br>

  </body>

</html>