<html>
  <head>
    <meta content="text/html; charset=windows-1252"
      http-equiv="Content-Type">
  </head>
  <body bgcolor="#FFFFFF" text="#000000">
    <p>is this patch dependent on any others for 2.8.0?  It didn't go
      cleanly but I believe I got it in without problems. It fails
      outright with the same errors. wrq_sqe=2 is set.  <br>
    </p>
    <p>I'm pretty sure I had this one applied when I was trying routing
      OPA-mlx5 and experienced the same but I haven't gotten back to
      that effort yet.<br>
    </p>
    <br>
    <div class="moz-cite-prefix">On 11/01/2016 08:18 PM, Brian W.
      Johanson wrote:<br>
    </div>
    <blockquote cite="mid:85260553-b59f-e26b-451d-6096ef46667d@psc.edu"
      type="cite">
      <meta content="text/html; charset=windows-1252"
        http-equiv="Content-Type">
      <p>Great, thanks Doug! <br>
      </p>
      <p>Quotas are not enabled.  <br>
      </p>
      <p>There are a few nodes that were exhibiting the issue fairly
        consistently.  We have recently added  70 clients (~900 total)
        which seems to have caused this to happen more frequently.</p>
      <p>-b<br>
      </p>
      <br>
      <div class="moz-cite-prefix">On 11/01/2016 07:57 PM, Oucharek,
        Doug S wrote:<br>
      </div>
      <blockquote
        cite="mid:4E4BA0EB-B42D-4CAD-9ADC-2D84CD98B137@intel.com"
        type="cite">
        <meta http-equiv="Content-Type" content="text/html;
          charset=windows-1252">
        Hi Brian,
        <div class=""><br class="">
        </div>
        <div class="">You need this patch: <a moz-do-not-send="true"
            href="http://review.whamcloud.com/#/c/12451" class="">http://review.whamcloud.com/#/c/12451</a>.
           It has not landed to master yet and is off by default.  To
          activate it, add this module parameter line to your nodes (all
          of them):</div>
        <div class=""><br class="">
        </div>
        <div class=""><span style="color: rgb(51, 51, 51); font-family:
            Arial, sans-serif; font-size: 14px; background-color:
            rgb(245, 245, 245);" class="">options ko2iblnd wrq_sge=2</span></div>
        <div class=""><br class="">
        </div>
        <div class="">The issue is that something is causing an offset
          to be introduced to the bulk transfers.  That causes a
          misalignment of the source and destination fragments.  Due to
          how the algorithm works, you require twice as many descriptors
          to the fragments to do the RDMA operation.  So, you are
          running out of descriptors when you are only halfway done
          configuring the transfer.  The above patch creates two sets of
          descriptors so the second set can be utilized in situations
          like this.  The fix operates on the nodes which are doing the
          bulk transfers.  Since you can both read and write bulk data,
          you need the fix on server, clients, and LNet routers
          (basically, everywhere).</div>
        <div class=""><br class="">
        </div>
        <div class="">Question: are you using the quotas feature and
          could it be at or approaching a limit?  There has been some
          evidence that the quotas feature could be introducing the
          offset to bulk transfers.</div>
        <div class=""><br class="">
        </div>
        <div class="">Doug</div>
        <div class=""><br class="">
          <div>
            <blockquote type="cite" class="">
              <div class="">On Nov 1, 2016, at 4:08 PM, Brian W.
                Johanson <<a moz-do-not-send="true"
                  href="mailto:bjohanso@psc.edu" class="">bjohanso@psc.edu</a>>
                wrote:</div>
              <br class="Apple-interchange-newline">
              <div class="">
                <div bgcolor="#FFFFFF" text="#000000" class="">Centos
                  7.2<br class="">
                  Lustre 2.8.0<br class="">
                  ZFS 0.6.5.5<br class="">
                  OPA <span class="version"><span class="value">10.2.0.0.158<br
                        class="">
                      <br class="">
                      <br class="">
                      The clients and servers are on the same OPA
                      network, no routing.  Once a client gets in this
                      state, the filesystem performance drops to a
                      faction of what it is capable of.<br class="">
                      The client must be rebooted to clear the issue.  <br
                        class="">
                      <br class="">
                    </span></span>
                  <p class="">I imagine I am missing a bug in jira for
                    this issue, does this look like a known issue?<br
                      class="">
                  </p>
                  <p class=""><br class="">
                  </p>
                  <p class="">Pertinent debug messages from the server:<br
                      class="">
                  </p>
                  <p class="">00000800:00020000:34.0:1478026118.277782:0:29892:0:(o2iblnd_cb.c:3109:kiblnd_check_txs_locked())
                    Timed out tx: active_txs, 4 seconds<br class="">
00000800:00020000:34.0:1478026118.277785:0:29892:0:(o2iblnd_cb.c:3172:kiblnd_check_conns())
                    Timed out RDMA with 10.4.119.112@o2ib (3): c: 112,
                    oc: 0, rc: 66<br class="">
00000800:00000100:34.0:1478026118.277787:0:29892:0:(o2iblnd_cb.c:1913:kiblnd_close_conn_locked())
                    Closing conn to 10.4.119.112@o2ib: error
                    -110(waiting)<br class="">
00000100:00020000:34.0:1478026118.277844:0:29892:0:(events.c:447:server_bulk_callback())
                    event type 5, status -103, desc ffff883e8e8bcc00<br
                      class="">
00000100:00020000:34.0:1478026118.288714:0:29892:0:(events.c:447:server_bulk_callback())
                    event type 3, status -103, desc ffff883e8e8bcc00<br
                      class="">
00000100:00020000:34.0:1478026118.299574:0:29892:0:(events.c:447:server_bulk_callback())
                    event type 5, status -103, desc ffff8810e92e9c00<br
                      class="">
00000100:00020000:34.0:1478026118.310434:0:29892:0:(events.c:447:server_bulk_callback())
                    event type 3, status -103, desc ffff8810e92e9c00</p>
                  <br class="">
                  And from the client:<br class="">
                  <br class="">
                  <p class="">00000400:00000100:8.0:1477949860.565777:0:3629:0:(lib-move.c:1489:lnet_parse_put())
                    Dropping PUT from 12345-10.4.108.81@o2ib portal 4
                    match 1549728742532740 offset 0 length 192: 4<br
                      class="">
00000400:00000100:8.0:1477949860.565782:0:3629:0:(lib-move.c:1489:lnet_parse_put())
                    Dropping PUT from 12345-10.4.108.81@o2ib portal 4
                    match 1549728742532740 offset 0 length 192: 4<br
                      class="">
00000800:00020000:8.0:1477949860.702666:0:3629:0:(o2iblnd_cb.c:1094:kiblnd_init_rdma())
                    RDMA has too many fragments for peer
                    10.4.108.81@o2ib (32), src idx/frags: 16/27 dst
                    idx/frags: 16/27<br class="">
00000800:00020000:8.0:1477949860.702667:0:3629:0:(o2iblnd_cb.c:1689:kiblnd_reply())
                    Can't setup rdma for GET from 10.4.108.81@o2ib: -90<br
                      class="">
00000100:00020000:8.0:1477949860.702669:0:3629:0:(events.c:201:client_bulk_callback())
                    event type 1, status -5, desc ffff880fd5d9bc00<br
                      class="">
00000800:00020000:8.0:1477949860.816666:0:3629:0:(o2iblnd_cb.c:1094:kiblnd_init_rdma())
                    RDMA has too many fragments for peer
                    10.4.108.81@o2ib (32), src idx/frags: 16/27 dst
                    idx/frags: 16/27<br class="">
00000800:00020000:8.0:1477949860.816668:0:3629:0:(o2iblnd_cb.c:1689:kiblnd_reply())
                    Can't setup rdma for GET from 10.4.108.81@o2ib: -90<br
                      class="">
00000100:00020000:8.0:1477949860.816669:0:3629:0:(events.c:201:client_bulk_callback())
                    event type 1, status -5, desc ffff880fd5d9bc00<br
                      class="">
00000400:00000100:8.0:1477949861.573660:0:3629:0:(lib-move.c:1489:lnet_parse_put())
                    Dropping PUT from 12345-10.4.108.81@o2ib portal 4
                    match 1549728742532740 offset 0 length 192: 4<br
                      class="">
00000400:00000100:8.0:1477949861.573664:0:3629:0:(lib-move.c:1489:lnet_parse_put())
                    Dropping PUT from 12345-10.4.108.81@o2ib portal 4
                    match 1549728742532740 offset 0 length 192: 4<br
                      class="">
00000400:00000100:8.0:1477949861.573667:0:3629:0:(lib-move.c:1489:lnet_parse_put())
                    Dropping PUT from 12345-10.4.108.81@o2ib portal 4
                    match 1549728742532740 offset 0 length 192: 4<br
                      class="">
00000400:00000100:8.0:1477949861.573669:0:3629:0:(lib-move.c:1489:lnet_parse_put())
                    Dropping PUT from 12345-10.4.108.81@o2ib portal 4
                    match 1549728742532740 offset 0 length 192: 4<br
                      class="">
00000400:00000100:8.0:1477949861.573671:0:3629:0:(lib-move.c:1489:lnet_parse_put())
                    Dropping PUT from 12345-10.4.108.81@o2ib portal 4
                    match 1549728742532740 offset 0 length 192: 4<br
                      class="">
00000400:00000100:8.0:1477949861.573673:0:3629:0:(lib-move.c:1489:lnet_parse_put())
                    Dropping PUT from 12345-10.4.108.81@o2ib portal 4
                    match 1549728742532740 offset 0 length 192: 4<br
                      class="">
00000400:00000100:8.0:1477949861.573675:0:3629:0:(lib-move.c:1489:lnet_parse_put())
                    Dropping PUT from 12345-10.4.108.81@o2ib portal 4
                    match 1549728742532740 offset 0 length 192: 4<br
                      class="">
00000400:00000100:8.0:1477949861.573677:0:3629:0:(lib-move.c:1489:lnet_parse_put())
                    Dropping PUT from 12345-10.4.108.81@o2ib portal 4
                    match 1549728742532740 offset 0 length 192: 4<br
                      class="">
00000800:00020000:8.0:1477949861.721668:0:3629:0:(o2iblnd_cb.c:1094:kiblnd_init_rdma())
                    RDMA has too many fragments for peer
                    10.4.108.81@o2ib (32), src idx/frags: 16/27 dst
                    idx/frags: 16/27<br class="">
00000800:00020000:8.0:1477949861.721669:0:3629:0:(o2iblnd_cb.c:1689:kiblnd_reply())
                    Can't setup rdma for GET from 10.4.108.81@o2ib: -90<br
                      class="">
00000100:00020000:8.0:1477949861.721670:0:3629:0:(events.c:201:client_bulk_callback())
                    event type 1, status -5, desc ffff880fd5d9bc00<br
                      class="">
00000800:00020000:8.0:1477949861.836668:0:3629:0:(o2iblnd_cb.c:1094:kiblnd_init_rdma())
                    RDMA has too many fragments for peer
                    10.4.108.81@o2ib (32), src idx/frags: 16/27 dst
                    idx/frags: 16/27<br class="">
00000800:00020000:8.0:1477949861.836669:0:3629:0:(o2iblnd_cb.c:1689:kiblnd_reply())
                    Can't setup rdma for GET from 10.4.108.81@o2ib: -90<br
                      class="">
00000100:00020000:8.0:1477949861.836670:0:3629:0:(events.c:201:client_bulk_callback())
                    event type 1, status -5, desc ffff880fd5d9bc00<br
                      class="">
00000800:00020000:8.0:1477949862.061668:0:3629:0:(o2iblnd_cb.c:1094:kiblnd_init_rdma())
                    RDMA has too many fragments for peer
                    10.4.108.81@o2ib (32), src idx/frags: 16/27 dst
                    idx/frags: 16/27<br class="">
00000800:00020000:8.0:1477949862.061669:0:3629:0:(o2iblnd_cb.c:1689:kiblnd_reply())
                    Can't setup rdma for GET from 10.4.108.81@o2ib: -90<br
                      class="">
00000100:00020000:8.0:1477949862.061670:0:3629:0:(events.c:201:client_bulk_callback())
                    event type 1, status -5, desc ffff880fd5d9bc00<br
                      class="">
                    <br class="">
                  </p>
                </div>
                _______________________________________________<br
                  class="">
                lustre-discuss mailing list<br class="">
                <a moz-do-not-send="true"
                  href="mailto:lustre-discuss@lists.lustre.org" class="">lustre-discuss@lists.lustre.org</a><br
                  class="">
                <a moz-do-not-send="true" class="moz-txt-link-freetext"
href="http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org">http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org</a><br
                  class="">
              </div>
            </blockquote>
          </div>
          <br class="">
        </div>
        <br>
        <fieldset class="mimeAttachmentHeader"></fieldset>
        <br>
        <pre wrap="">_______________________________________________
lustre-discuss mailing list
<a moz-do-not-send="true" class="moz-txt-link-abbreviated" href="mailto:lustre-discuss@lists.lustre.org">lustre-discuss@lists.lustre.org</a>
<a moz-do-not-send="true" class="moz-txt-link-freetext" href="http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org">http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org</a>
</pre>
      </blockquote>
      <br>
      <br>
      <fieldset class="mimeAttachmentHeader"></fieldset>
      <br>
      <pre wrap="">_______________________________________________
lustre-discuss mailing list
<a class="moz-txt-link-abbreviated" href="mailto:lustre-discuss@lists.lustre.org">lustre-discuss@lists.lustre.org</a>
<a class="moz-txt-link-freetext" href="http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org">http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org</a>
</pre>
    </blockquote>
    <br>
  </body>
</html>