<html>
  <head>
    <meta http-equiv="Content-Type" content="text/html;
      charset=windows-1252">
  </head>
  <body text="#000000" bgcolor="#FFFFFF">
    <div class="moz-cite-prefix">On 10-12-2017 06:07, Dilger, Andreas
      wrote:<br>
    </div>
    <blockquote type="cite"
      cite="mid:36F4978F-D81F-45FF-9B69-34CED1042436@intel.com">
      <meta http-equiv="Content-Type" content="text/html;
        charset=windows-1252">
      Based on the messages on the client, this isn’t related to mmap()
      or writes done by the client, since the data has the same checksum
      from before it was sent and after it got the checksum error
      returned from the server. That means the pages did not change on
      the client.
      <div><br>
      </div>
      <div>Possible causes include the client network card, server
        network card, memory, or possibly the OFED driver?  It could of
        course be something in Lustre/LNet, though we haven’t had any
        reports of anything similar. </div>
      <div><br>
      </div>
      <div>When the checksum code was first written, it was motivated by
        a faulty Ethernet NIC that had TCP checksum offload, but bad
        onboard cache, and the data was corrupted when copied onto the
        NIC but the TCP checksum was computed on the bad data and the
        checksum was “correct” when received by the server, so it didn’t
        cause TCP resends. </div>
      <div><br>
      </div>
      <div>Are you seeing this on multiple servers?  The client log only
        shows one server, while the server log shows multiple clients.
         If it is only happening on one server it might point to
        hardware. </div>
    </blockquote>
    Yes, we are seeing it on all servers.<br>
    <blockquote type="cite"
      cite="mid:36F4978F-D81F-45FF-9B69-34CED1042436@intel.com">
      <div>Did you also upgrade the kernel and OFED at the same time as
        Lustre? You could try building Lustre 2.10.1 on the old 2.9.0
        kernel and OFED to see if that works properly. <br>
      </div>
    </blockquote>
    We upgraded to CentOS 7.4 and are using the included OFED on the
    servers. Also, we upgraded the firmware on the server IB cards. We
    will check further if this combination has compatibility issues.<br>
    <br>
    Cheers,<br>
    Hans Henrik<br>
    <blockquote type="cite"
      cite="mid:36F4978F-D81F-45FF-9B69-34CED1042436@intel.com">
      <div>
        <br>
        <div id="AppleMailSignature">Cheers, Andreas</div>
        <div><br>
          On Dec 9, 2017, at 11:09, Hans Henrik Happe <<a
            href="mailto:happe@nbi.dk" moz-do-not-send="true">happe@nbi.dk</a>>
          wrote:<br>
          <br>
        </div>
        <blockquote type="cite">
          <div><span></span><br>
            <span></span><br>
            <span>On 09-12-2017 18:57, Hans Henrik Happe wrote:</span><br>
            <blockquote type="cite"><span>On 07-12-2017 21:36, Dilger,
                Andreas wrote:</span><br>
            </blockquote>
            <blockquote type="cite">
              <blockquote type="cite"><span>On Dec 7, 2017, at 10:37,
                  Hans Henrik Happe <<a href="mailto:happe@nbi.dk"
                    moz-do-not-send="true">happe@nbi.dk</a>> wrote:</span><br>
              </blockquote>
            </blockquote>
            <blockquote type="cite">
              <blockquote type="cite">
                <blockquote type="cite"><span>Hi,</span><br>
                </blockquote>
              </blockquote>
            </blockquote>
            <blockquote type="cite">
              <blockquote type="cite">
                <blockquote type="cite"><span></span><br>
                </blockquote>
              </blockquote>
            </blockquote>
            <blockquote type="cite">
              <blockquote type="cite">
                <blockquote type="cite"><span>Can an application cause
                    BAD CHECKSUM errors in Lustre logs by somehow</span><br>
                </blockquote>
              </blockquote>
            </blockquote>
            <blockquote type="cite">
              <blockquote type="cite">
                <blockquote type="cite"><span>overwriting memory while
                    being DMA'ed to network?</span><br>
                </blockquote>
              </blockquote>
            </blockquote>
            <blockquote type="cite">
              <blockquote type="cite">
                <blockquote type="cite"><span></span><br>
                </blockquote>
              </blockquote>
            </blockquote>
            <blockquote type="cite">
              <blockquote type="cite">
                <blockquote type="cite"><span>After upgrading to 2.10.1
                    on the server side we started seeing this from</span><br>
                </blockquote>
              </blockquote>
            </blockquote>
            <blockquote type="cite">
              <blockquote type="cite">
                <blockquote type="cite"><span>a user's application (MPI
                    I/O). Both 2.9.0 and 2.10.1 clients emit these</span><br>
                </blockquote>
              </blockquote>
            </blockquote>
            <blockquote type="cite">
              <blockquote type="cite">
                <blockquote type="cite"><span>errors. We have not yet
                    established weather the application is doing</span><br>
                </blockquote>
              </blockquote>
            </blockquote>
            <blockquote type="cite">
              <blockquote type="cite">
                <blockquote type="cite"><span>things correctly.</span><br>
                </blockquote>
              </blockquote>
            </blockquote>
            <blockquote type="cite">
              <blockquote type="cite"><span>If applications are using
                  mmap IO it is possible for the page to become
                  inconsistent after the checksum has been computed.
                   However, mmap IO is</span><br>
              </blockquote>
            </blockquote>
            <blockquote type="cite">
              <blockquote type="cite"><span>normally detected by the
                  client and no message should be printed.</span><br>
              </blockquote>
            </blockquote>
            <blockquote type="cite">
              <blockquote type="cite"><span></span><br>
              </blockquote>
            </blockquote>
            <blockquote type="cite">
              <blockquote type="cite"><span>There isn't anything that
                  the application needs to do, since the client will
                  resend the data if there is a checksum error, but the
                  resends do slow down the IO.  If the inconsistency is
                  on the client, there is no cause for concern (though
                  it would be good to figure out the root cause).</span><br>
              </blockquote>
            </blockquote>
            <blockquote type="cite">
              <blockquote type="cite"><span></span><br>
              </blockquote>
            </blockquote>
            <blockquote type="cite">
              <blockquote type="cite"><span>It would be interesting to
                  see what the exact error message is, since that will
                  say whether the data became inconsistent on the
                  client, or over the network.  If the inconsistency is
                  over the network or on the server, then that may point
                  to hardware issues.</span><br>
              </blockquote>
            </blockquote>
            <blockquote type="cite"><span>I've attached logs from a
                server and a client.</span><br>
            </blockquote>
            <span></span><br>
            <span>There was a cut n' paste error in the first set of
              files. This should be</span><br>
            <span>better.</span><br>
            <span></span><br>
            <span>Looks like a something goes wrong over the network.</span><br>
            <span></span><br>
            <span>Cheers,</span><br>
            <span>Hans Henrik</span><br>
            <span></span><br>
          </div>
        </blockquote>
        <blockquote type="cite">
          <div><client.log></div>
        </blockquote>
        <blockquote type="cite">
          <div><server.log></div>
        </blockquote>
        <blockquote type="cite">
          <div><span>_______________________________________________</span><br>
            <span>lustre-discuss mailing list</span><br>
            <span><a href="mailto:lustre-discuss@lists.lustre.org"
                moz-do-not-send="true">lustre-discuss@lists.lustre.org</a></span><br>
            <span><a
                href="http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org"
                moz-do-not-send="true">http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org</a></span><br>
          </div>
        </blockquote>
      </div>
    </blockquote>
    <p><br>
    </p>
  </body>
</html>