[lustre-discuss] Lustre 2.10.1 + RHEL7 Lock Callback Timer Expired

Sun Dec 3 08:55:47 PST 2017

Hello,
are you using Infiniband ?
if so what are the peer credit settings ?

 cat /proc/sys/lnet/nis
 cat /proc/sys/lnet/peers

On 12/3/17 8:38 AM, E.S. Rosenberg wrote:
> Did you find the problem? Were there any useful suggestions off-list?
>
> On Wed, Nov 29, 2017 at 1:34 PM, Charles A Taylor <chasman at ufl.edu
> <mailto:chasman at ufl.edu>> wrote:
>
>
>     We have a genomics pipeline app (supernova) that fails
>     consistently due to the client being evicted on the OSSs with a 
>     “lock callback timer expired”.  I doubled “nlm_enqueue_min” across
>     the cluster but then the timer simply expired after 200s rather
>     than 100s so I don’t think that is the answer.   The syslog/dmesg
>     on the client shows no signs of distress and it is a “bigmem”
>     machine with 1TB of RAM.
>
>     The eviction appears to come while the application is processing a
>     large number (~300) of data “chunks” (i.e. files) which occur in
>     pairs.
>
>     -rw-r--r-- 1 chasman ufhpc 24 Nov 28 23:31
>     ./Tdtest915/ASSEMBLER_CS/_ASSEMBLER/_ASM_SN/SHARD_ASM/fork0/join/files/chunk233.sedge_bcs
>     -rw-r--r-- 1 chasman ufhpc 34M Nov 28 23:31
>     ./Tdtest915/ASSEMBLER_CS/_ASSEMBLER/_ASM_SN/SHARD_ASM/fork0/join/files/chunk233.sedge_asm
>
>     I assume the 24-byte file is metadata (an index or some such) and
>     the 34M file is the actual data but I’m just guessing since I’m
>     completely unfamiliar with the application.
>
>     The write error is,
>
>         #define ENOTCONN        107     /* Transport endpoint is not
>     connected */
>
>     which occurs after the OSS eviction.  This was reproducible under
>     2.5.3.90 as well.  We hoped that upgrading to 2.10.1 would resolve
>     the issue but it has not.
>
>     This is the first application (in 10 years) we have encountered
>     that consistently and reliably fails when run over Lustre.  I’m
>     not sure at this point whether this is a bug or tuning issue.
>     If others have encountered and overcome something like this, we’d
>     be grateful to hear from you.
>
>     Regards,
>
>     Charles Taylor
>     UF Research Computing
>
>     OSS:
>     --------------
>     Nov 28 23:41:41 ufrcoss28 kernel: LustreError:
>     0:0:(ldlm_lockd.c:334:waiting_locks_callback()) ### lock callback
>     timer expired after 201s: evicing client at 10.13.136.74 at o2ib  ns:
>     filter-testfs-OST002e_UUID lock:
>     ffff880041717400/0x9bd23c8dc69323a1 lrc: 3/0,0 mode: PW/PW res:
>     [0x7ef2:0x0:0x0].0x0 rrc: 3 type: EXT [0->18446744073709551615]
>     (req 4096->1802239) flags: 0x60000400010020 nid: 10.13.136.74 at o2ib
>     remote: 0xe54f26957f2ac591 expref: 45 pid: 6836 timeout:
>     6488120506 lvb_type: 0
>
>     Client:
>     ———————
>     Nov 28 23:41:42 s5a-s23 kernel: LustreError: 11-0:
>     testfs-OST002e-osc-ffff88c053fe3800: operation ost_write to node
>     10.13.136.30 at o2ib failed: rc = -107
>     Nov 28 23:41:42 s5a-s23 kernel: Lustre:
>     testfs-OST002e-osc-ffff88c053fe3800: Connection to testfs-OST002e
>     (at 10.13.136.30 at o2ib) was lost; in progress operations using this
>     service will wait for recovery to complete
>     Nov 28 23:41:42 s5a-s23 kernel: LustreError: 167-0:
>     testfs-OST002e-osc-ffff88c053fe3800: This client was evicted by
>     testfs-OST002e; in progress operations using this service will fail.
>     Nov 28 23:41:42 s5a-s23 kernel: LustreError: 11-0:
>     testfs-OST002c-osc-ffff88c053fe3800: operation ost_punch to node
>     10.13.136.30 at o2ib failed: rc = -107
>     Nov 28 23:41:42 s5a-s23 kernel: Lustre:
>     testfs-OST002c-osc-ffff88c053fe3800: Connection to testfs-OST002c
>     (at 10.13.136.30 at o2ib) was lost; in progress operations using this
>     service will wait for recovery to complete
>     Nov 28 23:41:42 s5a-s23 kernel: LustreError: 167-0:
>     testfs-OST002c-osc-ffff88c053fe3800: This client was evicted by
>     testfs-OST002c; in progress operations using this service will fail.
>     Nov 28 23:41:47 s5a-s23 kernel: LustreError: 11-0:
>     testfs-OST0000-osc-ffff88c053fe3800: operation ost_statfs to node
>     10.13.136.23 at o2ib failed: rc = -107
>     Nov 28 23:41:47 s5a-s23 kernel: Lustre:
>     testfs-OST0000-osc-ffff88c053fe3800: Connection to testfs-OST0000
>     (at 10.13.136.23 at o2ib) was lost; in progress operations using this
>     service will wait for recovery to complete
>     Nov 28 23:41:47 s5a-s23 kernel: LustreError: 167-0:
>     testfs-OST0004-osc-ffff88c053fe3800: This client was evicted by
>     testfs-OST0004; in progress operations using this service will fail.
>     Nov 28 23:43:11 s5a-s23 kernel: Lustre:
>     testfs-OST0006-osc-ffff88c053fe3800: Connection restored to
>     10.13.136.24 at o2ib (at 10.13.136.24 at o2ib)
>     Nov 28 23:43:38 s5a-s23 kernel: Lustre:
>     testfs-OST002c-osc-ffff88c053fe3800: Connection restored to
>     10.13.136.30 at o2ib (at 10.13.136.30 at o2ib)
>     Nov 28 23:43:45 s5a-s23 kernel: Lustre:
>     testfs-OST0000-osc-ffff88c053fe3800: Connection restored to
>     10.13.136.23 at o2ib (at 10.13.136.23 at o2ib)
>     Nov 28 23:43:48 s5a-s23 kernel: Lustre:
>     testfs-OST0004-osc-ffff88c053fe3800: Connection restored to
>     10.13.136.23 at o2ib (at 10.13.136.23 at o2ib)
>     Nov 28 23:43:48 s5a-s23 kernel: Lustre: Skipped 3 previous similar
>     messages
>     Nov 28 23:43:55 s5a-s23 kernel: Lustre:
>     testfs-OST0007-osc-ffff88c053fe3800: Connection restored to
>     10.13.136.24 at o2ib (at 10.13.136.24 at o2ib)
>     Nov 28 23:43:55 s5a-s23 kernel: Lustre: Skipped 4 previous similar
>     messages
>
>     Some Details:
>     -------------------
>     OS: RHEL 7.4 (Linux ufrcoss28.ufhpc 3.10.0-693.2.2.el7_lustre.x86_64)
>     Lustre: 2.10.1 (lustre-2.10.1-1.el7.x86_64)
>     Client: 2.10.1
>          1 TB RAM
>          Mellanox ConnectX-3 IB/VPI HCAs
>          Linux s5a-s23.ufhpc 2.6.32-696.13.2.el6.x86_64
>          MOFED 3.2.2 IB stack
>          Lustre 2.10.1
>     Servers: 10 HA OSS pairs (20 OSSs)
>        128 GB RAM
>        6 OSTs (8+2 RAID-6) per OSS
>        Mellanox ConnectX-3 IB/VPI HCAs
>        RedHat EL7 Native IB Stack (i.e. not MOFED)
>
>     _______________________________________________
>     lustre-discuss mailing list
>     lustre-discuss at lists.lustre.org
>     <mailto:lustre-discuss at lists.lustre.org>
>     http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>     <http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org>
>
>
>
>
> _______________________________________________
> lustre-discuss mailing list
> lustre-discuss at lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20171203/0160850e/attachment-0001.html>