<html>

  <head>

    <meta http-equiv="Content-Type" content="text/html; charset=utf-8">

  </head>

  <body text="#000000" bgcolor="#FFFFFF">

    <div class="moz-cite-prefix">Hello,<br>

      are you using Infiniband ?<br>

      if so what are the peer credit settings ?<br>

      <br>

       cat /proc/sys/lnet/nis <br>

       cat /proc/sys/lnet/peers <br>

      <br>

      <br>

      On 12/3/17 8:38 AM, E.S. Rosenberg wrote:<br>

    </div>

    <blockquote type="cite"

cite="mid:CA+K1OzT7NQ6JUNOgSyKfNQtqioy0bUs7YMwTNscrV8rcORnVog@mail.gmail.com">

      <div dir="ltr">Did you find the problem? Were there any useful

        suggestions off-list?<br>

      </div>

      <div class="gmail_extra"><br>

        <div class="gmail_quote">On Wed, Nov 29, 2017 at 1:34 PM,

          Charles A Taylor <span dir="ltr"><<a

              href="mailto:chasman@ufl.edu" target="_blank"

              moz-do-not-send="true">chasman@ufl.edu</a>></span>

          wrote:<br>

          <blockquote class="gmail_quote" style="margin:0 0 0

            .8ex;border-left:1px #ccc solid;padding-left:1ex"><br>

            We have a genomics pipeline app (supernova) that fails

            consistently due to the client being evicted on the OSSs

            with a  “lock callback timer expired”.  I doubled

            “nlm_enqueue_min” across the cluster but then the timer

            simply expired after 200s rather than 100s so I don’t think

            that is the answer.   The syslog/dmesg on the client shows

            no signs of distress and it is a “bigmem” machine with 1TB

            of RAM.<br>

            <br>

            The eviction appears to come while the application is

            processing a large number (~300) of data “chunks” (i.e.

            files) which occur in pairs.<br>

            <br>

            -rw-r--r-- 1 chasman ufhpc 24 Nov 28 23:31

            ./Tdtest915/ASSEMBLER_CS/_<wbr>ASSEMBLER/_ASM_SN/SHARD_ASM/<wbr>fork0/join/files/chunk233.<wbr>sedge_bcs<br>

            -rw-r--r-- 1 chasman ufhpc 34M Nov 28 23:31

            ./Tdtest915/ASSEMBLER_CS/_<wbr>ASSEMBLER/_ASM_SN/SHARD_ASM/<wbr>fork0/join/files/chunk233.<wbr>sedge_asm<br>

            <br>

            I assume the 24-byte file is metadata (an index or some

            such) and the 34M file is the actual data but I’m just

            guessing since I’m completely unfamiliar with the

            application.<br>

            <br>

            The write error is,<br>

            <br>

                #define ENOTCONN        107     /* Transport endpoint is

            not connected */<br>

            <br>

            which occurs after the OSS eviction.  This was reproducible

            under 2.5.3.90 as well.  We hoped that upgrading to 2.10.1

            would resolve the issue but it has not.<br>

            <br>

            This is the first application (in 10 years) we have

            encountered that consistently and reliably fails when run

            over Lustre.  I’m not sure at this point whether this is a

            bug or tuning issue.<br>

            If others have encountered and overcome something like this,

            we’d be grateful to hear from you.<br>

            <br>

            Regards,<br>

            <br>

            Charles Taylor<br>

            UF Research Computing<br>

            <br>

            OSS:<br>

            --------------<br>

            Nov 28 23:41:41 ufrcoss28 kernel: LustreError:

            0:0:(ldlm_lockd.c:334:waiting_<wbr>locks_callback()) ###

            lock callback timer expired after 201s: evicing client at

            10.13.136.74@o2ib  ns: filter-testfs-OST002e_UUID lock:

            ffff880041717400/<wbr>0x9bd23c8dc69323a1 lrc: 3/0,0 mode:

            PW/PW res: [0x7ef2:0x0:0x0].0x0 rrc: 3 type: EXT

            [0->18446744073709551615] (req 4096->1802239) flags:

            0x60000400010020 nid: 10.13.136.74@o2ib remote:

            0xe54f26957f2ac591 expref: 45 pid: 6836 timeout: 6488120506

            lvb_type: 0<br>

            <br>

            Client:<br>

            ———————<br>

            Nov 28 23:41:42 s5a-s23 kernel: LustreError: 11-0:

            testfs-OST002e-osc-<wbr>ffff88c053fe3800: operation

            ost_write to node 10.13.136.30@o2ib failed: rc = -107<br>

            Nov 28 23:41:42 s5a-s23 kernel: Lustre: testfs-OST002e-osc-<wbr>ffff88c053fe3800:

            Connection to testfs-OST002e (at 10.13.136.30@o2ib) was

            lost; in progress operations using this service will wait

            for recovery to complete<br>

            Nov 28 23:41:42 s5a-s23 kernel: LustreError: 167-0:

            testfs-OST002e-osc-<wbr>ffff88c053fe3800: This client was

            evicted by testfs-OST002e; in progress operations using this

            service will fail.<br>

            Nov 28 23:41:42 s5a-s23 kernel: LustreError: 11-0:

            testfs-OST002c-osc-<wbr>ffff88c053fe3800: operation

            ost_punch to node 10.13.136.30@o2ib failed: rc = -107<br>

            Nov 28 23:41:42 s5a-s23 kernel: Lustre: testfs-OST002c-osc-<wbr>ffff88c053fe3800:

            Connection to testfs-OST002c (at 10.13.136.30@o2ib) was

            lost; in progress operations using this service will wait

            for recovery to complete<br>

            Nov 28 23:41:42 s5a-s23 kernel: LustreError: 167-0:

            testfs-OST002c-osc-<wbr>ffff88c053fe3800: This client was

            evicted by testfs-OST002c; in progress operations using this

            service will fail.<br>

            Nov 28 23:41:47 s5a-s23 kernel: LustreError: 11-0:

            testfs-OST0000-osc-<wbr>ffff88c053fe3800: operation

            ost_statfs to node 10.13.136.23@o2ib failed: rc = -107<br>

            Nov 28 23:41:47 s5a-s23 kernel: Lustre: testfs-OST0000-osc-<wbr>ffff88c053fe3800:

            Connection to testfs-OST0000 (at 10.13.136.23@o2ib) was

            lost; in progress operations using this service will wait

            for recovery to complete<br>

            Nov 28 23:41:47 s5a-s23 kernel: LustreError: 167-0:

            testfs-OST0004-osc-<wbr>ffff88c053fe3800: This client was

            evicted by testfs-OST0004; in progress operations using this

            service will fail.<br>

            Nov 28 23:43:11 s5a-s23 kernel: Lustre: testfs-OST0006-osc-<wbr>ffff88c053fe3800:

            Connection restored to 10.13.136.24@o2ib (at

            10.13.136.24@o2ib)<br>

            Nov 28 23:43:38 s5a-s23 kernel: Lustre: testfs-OST002c-osc-<wbr>ffff88c053fe3800:

            Connection restored to 10.13.136.30@o2ib (at

            10.13.136.30@o2ib)<br>

            Nov 28 23:43:45 s5a-s23 kernel: Lustre: testfs-OST0000-osc-<wbr>ffff88c053fe3800:

            Connection restored to 10.13.136.23@o2ib (at

            10.13.136.23@o2ib)<br>

            Nov 28 23:43:48 s5a-s23 kernel: Lustre: testfs-OST0004-osc-<wbr>ffff88c053fe3800:

            Connection restored to 10.13.136.23@o2ib (at

            10.13.136.23@o2ib)<br>

            Nov 28 23:43:48 s5a-s23 kernel: Lustre: Skipped 3 previous

            similar messages<br>

            Nov 28 23:43:55 s5a-s23 kernel: Lustre: testfs-OST0007-osc-<wbr>ffff88c053fe3800:

            Connection restored to 10.13.136.24@o2ib (at

            10.13.136.24@o2ib)<br>

            Nov 28 23:43:55 s5a-s23 kernel: Lustre: Skipped 4 previous

            similar messages<br>

            <br>

            Some Details:<br>

            -------------------<br>

            OS: RHEL 7.4 (Linux ufrcoss28.ufhpc

            3.10.0-693.2.2.el7_lustre.x86_<wbr>64)<br>

            Lustre: 2.10.1 (lustre-2.10.1-1.el7.x86_64)<br>

            Client: 2.10.1<br>

                 1 TB RAM<br>

                 Mellanox ConnectX-3 IB/VPI HCAs<br>

                 Linux s5a-s23.ufhpc 2.6.32-696.13.2.el6.x86_64<br>

                 MOFED 3.2.2 IB stack<br>

                 Lustre 2.10.1<br>

            Servers: 10 HA OSS pairs (20 OSSs)<br>

               128 GB RAM<br>

               6 OSTs (8+2 RAID-6) per OSS<br>

               Mellanox ConnectX-3 IB/VPI HCAs<br>

               RedHat EL7 Native IB Stack (i.e. not MOFED)<br>

            <br>

            ______________________________<wbr>_________________<br>

            lustre-discuss mailing list<br>

            <a href="mailto:lustre-discuss@lists.lustre.org"

              moz-do-not-send="true">lustre-discuss@lists.lustre.<wbr>org</a><br>

            <a

              href="http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org"

              rel="noreferrer" target="_blank" moz-do-not-send="true">http://lists.lustre.org/<wbr>listinfo.cgi/lustre-discuss-<wbr>lustre.org</a><br>

          </blockquote>

        </div>

        <br>

      </div>

      <br>

      <fieldset class="mimeAttachmentHeader"></fieldset>

      <br>

      <pre wrap="">_______________________________________________

lustre-discuss mailing list

<a class="moz-txt-link-abbreviated" href="mailto:lustre-discuss@lists.lustre.org">lustre-discuss@lists.lustre.org</a>

<a class="moz-txt-link-freetext" href="http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org">http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org</a>

</pre>

    </blockquote>

    <p><br>

    </p>

  </body>

</html>