<html>

  <head>

    <meta content="text/html; charset=ISO-8859-1"

      http-equiv="Content-Type">

  </head>

  <body text="#000000" bgcolor="#FFFFFF">

    That is an interesting mix.  Nothing shows up at all on the clients,

    even on those 3 that route to a second NIC.  On the OSS, it is quite

    the mix of up/down on the 3 routers, with no obvious pattern.<br>

    <br>

    Most of our traffic is on the 10.10 network, with the 3 machines

    shown below routing to a small number of clients on a more public

    network.<br>

    <br>

    FYI, the current situation is one in which all machines are happy,

    as far as I can tell.<br>

    <br>

    bob<br>

    <br>

    Running lctl show_route on all machines in lustre_fss.txt<br>

    On umdist05.local<br>

    net               tcp2 hops 1 gw                   10.10.1.52@tcp

    down<br>

    net               tcp2 hops 1 gw                   10.10.1.51@tcp

    down<br>

    net               tcp2 hops 1 gw                   10.10.1.50@tcp up<br>

             Succeeded<br>

    On umfs06.local<br>

    net               tcp2 hops 1 gw                   10.10.1.51@tcp

    down<br>

    net               tcp2 hops 1 gw                   10.10.1.50@tcp up<br>

    net               tcp2 hops 1 gw                   10.10.1.52@tcp up<br>

             Succeeded<br>

    On umdist01.local<br>

    net               tcp2 hops 1 gw                   10.10.1.52@tcp

    down<br>

    net               tcp2 hops 1 gw                   10.10.1.51@tcp

    down<br>

    net               tcp2 hops 1 gw                   10.10.1.50@tcp up<br>

             Succeeded<br>

    On umdist02.local<br>

    net               tcp2 hops 1 gw                   10.10.1.52@tcp

    down<br>

    net               tcp2 hops 1 gw                   10.10.1.51@tcp

    down<br>

    net               tcp2 hops 1 gw                   10.10.1.50@tcp up<br>

             Succeeded<br>

    On umdist03.local<br>

    net               tcp2 hops 1 gw                   10.10.1.51@tcp

    down<br>

    net               tcp2 hops 1 gw                   10.10.1.52@tcp up<br>

    net               tcp2 hops 1 gw                   10.10.1.50@tcp up<br>

             Succeeded<br>

    On umdist04.local<br>

    net               tcp2 hops 1 gw                   10.10.1.52@tcp

    down<br>

    net               tcp2 hops 1 gw                   10.10.1.51@tcp

    down<br>

    net               tcp2 hops 1 gw                   10.10.1.50@tcp up<br>

             Succeeded<br>

    On umdist07.local<br>

    net               tcp2 hops 1 gw                   10.10.1.50@tcp

    down<br>

    net               tcp2 hops 1 gw                   10.10.1.52@tcp

    down<br>

    net               tcp2 hops 1 gw                   10.10.1.51@tcp

    down<br>

             Succeeded<br>

    On umdist08.local<br>

    net               tcp2 hops 1 gw                   10.10.1.50@tcp

    down<br>

    net               tcp2 hops 1 gw                   10.10.1.52@tcp

    down<br>

    net               tcp2 hops 1 gw                   10.10.1.51@tcp

    down<br>

             Succeeded<br>

    <br>

    <div class="moz-cite-prefix">On 9/5/2013 4:01 PM, Kris Howard wrote:<br>

    </div>

    <blockquote

cite="mid:CAFrN90EOihnvp3kGdx9FTf0iRv3tmNHpQHHr=+TrwyksJ_jMPg@mail.gmail.com"

      type="cite">

      <div dir="ltr">Might check lctl show_route and look for downed

        routes.</div>

      <div class="gmail_extra"><br>

        <br>

        <div class="gmail_quote">On Thu, Sep 5, 2013 at 12:56 PM, Bob

          Ball <span dir="ltr"><<a moz-do-not-send="true"

              href="mailto:ball@umich.edu" target="_blank">ball@umich.edu</a>></span>

          wrote:<br>

          <blockquote class="gmail_quote" style="margin:0 0 0

            .8ex;border-left:1px #ccc solid;padding-left:1ex">We are

            running Lustre 2.1.6 on Scientific Linux 6.4, kernel

            2.6.32-358.11.1.el6.x86_64.  This was an upgrade from Lustre

            1.8.4 on SL5.<br>

            <br>

            We have had a few situations lately where a client stops

            talking to some subset of the OST (about 58 of these total

            on 8 OSS, nearly 500TB in total).  I have a couple of

            questions.<br>

            <br>

            1. "lctl dl"  on the OSS shows a smaller count on the

            affected servers; on the client, all OSS showed UP in "lctl

            dl".  Today, I first tried rebooting this OSS, but that did

            not change the situation.  I ended up rebooting the client

            before I could get full connectivity.  Is there any way from

            the client to get the reconnect, short of rebooting that

            client?<br>

            <br>

            2. It used to be the case under Lustre 1.8.4 that I could

            run "lfs df -h" on the client, and see all OST, even those

            where the connection was not working, for whatever reason.

             That is no longer the case, now the lfs command stops at

            the first, non-talking OST. This seems more like a bug than

            a feature.  Is there some other way to see a list of

            non-communicating OST on a client?<br>

            <br>

            Thanks in advance for any help offered.<br>

            <br>

            bob<br>

            <br>

            <br>

            <br>

            _______________________________________________<br>

            HPDD-discuss mailing list<br>

            <a moz-do-not-send="true"

              href="mailto:HPDD-discuss@lists.01.org" target="_blank">HPDD-discuss@lists.01.org</a><br>

            <a moz-do-not-send="true"

              href="https://lists.01.org/mailman/listinfo/hpdd-discuss"

              target="_blank">https://lists.01.org/mailman/listinfo/hpdd-discuss</a><br>

          </blockquote>

        </div>

        <br>

      </div>

    </blockquote>

    <br>

  </body>

</html>