[lustre-discuss] Lustre 2.10.1 + RHEL7 Lock Callback Timer Expired

Charles A Taylor chasman at ufl.edu
Wed Nov 29 03:34:35 PST 2017


We have a genomics pipeline app (supernova) that fails consistently due to the client being evicted on the OSSs with a  “lock callback timer expired”.  I doubled “nlm_enqueue_min” across the cluster but then the timer simply expired after 200s rather than 100s so I don’t think that is the answer.   The syslog/dmesg on the client shows no signs of distress and it is a “bigmem” machine with 1TB of RAM.  

The eviction appears to come while the application is processing a large number (~300) of data “chunks” (i.e. files) which occur in pairs.

-rw-r--r-- 1 chasman ufhpc 24 Nov 28 23:31 ./Tdtest915/ASSEMBLER_CS/_ASSEMBLER/_ASM_SN/SHARD_ASM/fork0/join/files/chunk233.sedge_bcs
-rw-r--r-- 1 chasman ufhpc 34M Nov 28 23:31 ./Tdtest915/ASSEMBLER_CS/_ASSEMBLER/_ASM_SN/SHARD_ASM/fork0/join/files/chunk233.sedge_asm

I assume the 24-byte file is metadata (an index or some such) and the 34M file is the actual data but I’m just guessing since I’m completely unfamiliar with the application.  

The write error is,

    #define ENOTCONN        107     /* Transport endpoint is not connected */

which occurs after the OSS eviction.  This was reproducible under 2.5.3.90 as well.  We hoped that upgrading to 2.10.1 would resolve the issue but it has not.  

This is the first application (in 10 years) we have encountered that consistently and reliably fails when run over Lustre.  I’m not sure at this point whether this is a bug or tuning issue.
If others have encountered and overcome something like this, we’d be grateful to hear from you.

Regards,

Charles Taylor
UF Research Computing

OSS:
--------------
Nov 28 23:41:41 ufrcoss28 kernel: LustreError: 0:0:(ldlm_lockd.c:334:waiting_locks_callback()) ### lock callback timer expired after 201s: evicing client at 10.13.136.74 at o2ib  ns: filter-testfs-OST002e_UUID lock: ffff880041717400/0x9bd23c8dc69323a1 lrc: 3/0,0 mode: PW/PW res: [0x7ef2:0x0:0x0].0x0 rrc: 3 type: EXT [0->18446744073709551615] (req 4096->1802239) flags: 0x60000400010020 nid: 10.13.136.74 at o2ib remote: 0xe54f26957f2ac591 expref: 45 pid: 6836 timeout: 6488120506 lvb_type: 0

Client:
———————
Nov 28 23:41:42 s5a-s23 kernel: LustreError: 11-0: testfs-OST002e-osc-ffff88c053fe3800: operation ost_write to node 10.13.136.30 at o2ib failed: rc = -107
Nov 28 23:41:42 s5a-s23 kernel: Lustre: testfs-OST002e-osc-ffff88c053fe3800: Connection to testfs-OST002e (at 10.13.136.30 at o2ib) was lost; in progress operations using this service will wait for recovery to complete
Nov 28 23:41:42 s5a-s23 kernel: LustreError: 167-0: testfs-OST002e-osc-ffff88c053fe3800: This client was evicted by testfs-OST002e; in progress operations using this service will fail.
Nov 28 23:41:42 s5a-s23 kernel: LustreError: 11-0: testfs-OST002c-osc-ffff88c053fe3800: operation ost_punch to node 10.13.136.30 at o2ib failed: rc = -107
Nov 28 23:41:42 s5a-s23 kernel: Lustre: testfs-OST002c-osc-ffff88c053fe3800: Connection to testfs-OST002c (at 10.13.136.30 at o2ib) was lost; in progress operations using this service will wait for recovery to complete
Nov 28 23:41:42 s5a-s23 kernel: LustreError: 167-0: testfs-OST002c-osc-ffff88c053fe3800: This client was evicted by testfs-OST002c; in progress operations using this service will fail.
Nov 28 23:41:47 s5a-s23 kernel: LustreError: 11-0: testfs-OST0000-osc-ffff88c053fe3800: operation ost_statfs to node 10.13.136.23 at o2ib failed: rc = -107
Nov 28 23:41:47 s5a-s23 kernel: Lustre: testfs-OST0000-osc-ffff88c053fe3800: Connection to testfs-OST0000 (at 10.13.136.23 at o2ib) was lost; in progress operations using this service will wait for recovery to complete
Nov 28 23:41:47 s5a-s23 kernel: LustreError: 167-0: testfs-OST0004-osc-ffff88c053fe3800: This client was evicted by testfs-OST0004; in progress operations using this service will fail.
Nov 28 23:43:11 s5a-s23 kernel: Lustre: testfs-OST0006-osc-ffff88c053fe3800: Connection restored to 10.13.136.24 at o2ib (at 10.13.136.24 at o2ib)
Nov 28 23:43:38 s5a-s23 kernel: Lustre: testfs-OST002c-osc-ffff88c053fe3800: Connection restored to 10.13.136.30 at o2ib (at 10.13.136.30 at o2ib)
Nov 28 23:43:45 s5a-s23 kernel: Lustre: testfs-OST0000-osc-ffff88c053fe3800: Connection restored to 10.13.136.23 at o2ib (at 10.13.136.23 at o2ib)
Nov 28 23:43:48 s5a-s23 kernel: Lustre: testfs-OST0004-osc-ffff88c053fe3800: Connection restored to 10.13.136.23 at o2ib (at 10.13.136.23 at o2ib)
Nov 28 23:43:48 s5a-s23 kernel: Lustre: Skipped 3 previous similar messages
Nov 28 23:43:55 s5a-s23 kernel: Lustre: testfs-OST0007-osc-ffff88c053fe3800: Connection restored to 10.13.136.24 at o2ib (at 10.13.136.24 at o2ib)
Nov 28 23:43:55 s5a-s23 kernel: Lustre: Skipped 4 previous similar messages

Some Details:
-------------------
OS: RHEL 7.4 (Linux ufrcoss28.ufhpc 3.10.0-693.2.2.el7_lustre.x86_64)
Lustre: 2.10.1 (lustre-2.10.1-1.el7.x86_64)
Client: 2.10.1 
     1 TB RAM
     Mellanox ConnectX-3 IB/VPI HCAs
     Linux s5a-s23.ufhpc 2.6.32-696.13.2.el6.x86_64 
     MOFED 3.2.2 IB stack
     Lustre 2.10.1
Servers: 10 HA OSS pairs (20 OSSs)
   128 GB RAM
   6 OSTs (8+2 RAID-6) per OSS 
   Mellanox ConnectX-3 IB/VPI HCAs 
   RedHat EL7 Native IB Stack (i.e. not MOFED)
   


More information about the lustre-discuss mailing list