<html>

  <head>

    <meta content="text/html; charset=ISO-8859-1"

      http-equiv="Content-Type">

  </head>

  <body text="#000000" bgcolor="#FFFFFF">

    There are very few clients right now on the tcp2 network.  AFAIK,

    they are happy.  No one is complaining, but they may not be using

    the Lustre storage right now, either.<br>

    <br>

    bob<br>

    <br>

    <div class="moz-cite-prefix">On 9/5/2013 4:22 PM, Kris Howard wrote:<br>

    </div>

    <blockquote

cite="mid:CAFrN90Gsv6jO_yO4vTv0LTyyLbMHX8aS7=BqYkhPzH3tp-r-bw@mail.gmail.com"

      type="cite">

      <div dir="ltr">If you lctl ping 10.10.X.XX@tcp from both sides it

        should bring the route up.

        <div>With all of those down routes all is happy?</div>

        <div>hmm.</div>

      </div>

      <div class="gmail_extra"><br>

        <br>

        <div class="gmail_quote">

          On Thu, Sep 5, 2013 at 1:14 PM, Bob Ball <span dir="ltr"><<a

              moz-do-not-send="true" href="mailto:ball@umich.edu"

              target="_blank">ball@umich.edu</a>></span> wrote:<br>

          <blockquote class="gmail_quote" style="margin:0 0 0

            .8ex;border-left:1px #ccc solid;padding-left:1ex">

            <div text="#000000" bgcolor="#FFFFFF"> That is an

              interesting mix.  Nothing shows up at all on the clients,

              even on those 3 that route to a second NIC.  On the OSS,

              it is quite the mix of up/down on the 3 routers, with no

              obvious pattern.<br>

              <br>

              Most of our traffic is on the 10.10 network, with the 3

              machines shown below routing to a small number of clients

              on a more public network.<br>

              <br>

              FYI, the current situation is one in which all machines

              are happy, as far as I can tell.<br>

              <br>

              bob<br>

              <br>

              Running lctl show_route on all machines in lustre_fss.txt<br>

              On umdist05.local<br>

              net               tcp2 hops 1 gw                  

              10.10.1.52@tcp down<br>

              net               tcp2 hops 1 gw                  

              10.10.1.51@tcp down<br>

              net               tcp2 hops 1 gw                  

              10.10.1.50@tcp up<br>

                       Succeeded<br>

              On umfs06.local<br>

              net               tcp2 hops 1 gw                  

              10.10.1.51@tcp down<br>

              net               tcp2 hops 1 gw                  

              10.10.1.50@tcp up<br>

              net               tcp2 hops 1 gw                  

              10.10.1.52@tcp up<br>

                       Succeeded<br>

              On umdist01.local<br>

              net               tcp2 hops 1 gw                  

              10.10.1.52@tcp down<br>

              net               tcp2 hops 1 gw                  

              10.10.1.51@tcp down<br>

              net               tcp2 hops 1 gw                  

              10.10.1.50@tcp up<br>

                       Succeeded<br>

              On umdist02.local<br>

              net               tcp2 hops 1 gw                  

              10.10.1.52@tcp down<br>

              net               tcp2 hops 1 gw                  

              10.10.1.51@tcp down<br>

              net               tcp2 hops 1 gw                  

              10.10.1.50@tcp up<br>

                       Succeeded<br>

              On umdist03.local<br>

              net               tcp2 hops 1 gw                  

              10.10.1.51@tcp down<br>

              net               tcp2 hops 1 gw                  

              10.10.1.52@tcp up<br>

              net               tcp2 hops 1 gw                  

              10.10.1.50@tcp up<br>

                       Succeeded<br>

              On umdist04.local<br>

              net               tcp2 hops 1 gw                  

              10.10.1.52@tcp down<br>

              net               tcp2 hops 1 gw                  

              10.10.1.51@tcp down<br>

              net               tcp2 hops 1 gw                  

              10.10.1.50@tcp up<br>

                       Succeeded<br>

              On umdist07.local<br>

              net               tcp2 hops 1 gw                  

              10.10.1.50@tcp down<br>

              net               tcp2 hops 1 gw                  

              10.10.1.52@tcp down<br>

              net               tcp2 hops 1 gw                  

              10.10.1.51@tcp down<br>

                       Succeeded<br>

              On umdist08.local<br>

              net               tcp2 hops 1 gw                  

              10.10.1.50@tcp down<br>

              net               tcp2 hops 1 gw                  

              10.10.1.52@tcp down<br>

              net               tcp2 hops 1 gw                  

              10.10.1.51@tcp down<br>

                       Succeeded

              <div>

                <div class="h5"><br>

                  <br>

                  <div>On 9/5/2013 4:01 PM, Kris Howard wrote:<br>

                  </div>

                  <blockquote type="cite">

                    <div dir="ltr">Might check lctl show_route and look

                      for downed routes.</div>

                    <div class="gmail_extra"><br>

                      <br>

                      <div class="gmail_quote">On Thu, Sep 5, 2013 at

                        12:56 PM, Bob Ball <span dir="ltr"><<a

                            moz-do-not-send="true"

                            href="mailto:ball@umich.edu" target="_blank">ball@umich.edu</a>></span>

                        wrote:<br>

                        <blockquote class="gmail_quote" style="margin:0

                          0 0 .8ex;border-left:1px #ccc

                          solid;padding-left:1ex">We are running Lustre

                          2.1.6 on Scientific Linux 6.4, kernel

                          2.6.32-358.11.1.el6.x86_64.  This was an

                          upgrade from Lustre 1.8.4 on SL5.<br>

                          <br>

                          We have had a few situations lately where a

                          client stops talking to some subset of the OST

                          (about 58 of these total on 8 OSS, nearly

                          500TB in total).  I have a couple of

                          questions.<br>

                          <br>

                          1. "lctl dl"  on the OSS shows a smaller count

                          on the affected servers; on the client, all

                          OSS showed UP in "lctl dl".  Today, I first

                          tried rebooting this OSS, but that did not

                          change the situation.  I ended up rebooting

                          the client before I could get full

                          connectivity.  Is there any way from the

                          client to get the reconnect, short of

                          rebooting that client?<br>

                          <br>

                          2. It used to be the case under Lustre 1.8.4

                          that I could run "lfs df -h" on the client,

                          and see all OST, even those where the

                          connection was not working, for whatever

                          reason.  That is no longer the case, now the

                          lfs command stops at the first, non-talking

                          OST. This seems more like a bug than a

                          feature.  Is there some other way to see a

                          list of non-communicating OST on a client?<br>

                          <br>

                          Thanks in advance for any help offered.<br>

                          <br>

                          bob<br>

                          <br>

                          <br>

                          <br>

_______________________________________________<br>

                          HPDD-discuss mailing list<br>

                          <a moz-do-not-send="true"

                            href="mailto:HPDD-discuss@lists.01.org"

                            target="_blank">HPDD-discuss@lists.01.org</a><br>

                          <a moz-do-not-send="true"

                            href="https://lists.01.org/mailman/listinfo/hpdd-discuss"

                            target="_blank">https://lists.01.org/mailman/listinfo/hpdd-discuss</a><br>

                        </blockquote>

                      </div>

                      <br>

                    </div>

                  </blockquote>

                  <br>

                </div>

              </div>

            </div>

          </blockquote>

        </div>

        <br>

      </div>

    </blockquote>

    <br>

  </body>

</html>