[lustre-discuss] Lustre 2.10.1 + RHEL7 Lock Callback Timer Expired

Wed Dec 6 15:13:46 PST 2017

On Nov 29, 2017, at 04:34, Charles A Taylor <chasman at ufl.edu> wrote:
> 
> 
> We have a genomics pipeline app (supernova) that fails consistently due to the client being evicted on the OSSs with a  “lock callback timer expired”.  I doubled “nlm_enqueue_min” across the cluster but then the timer simply expired after 200s rather than 100s so I don’t think that is the answer.   The syslog/dmesg on the client shows no signs of distress and it is a “bigmem” machine with 1TB of RAM.  

Hi Charles,
I haven't seen much action on this email, so I thought I'd ask a few questions to see what is unusual in your system/application that might be causing problems.

The problem appears to be the client being unresponsive to server requests, or possibly a network problem, since the client is complaining about a number of different servers at the same time.

Firstly, is it only the "bigmem" client that is having problems, or does that happen with other clients?  Does this application run on other clients with less RAM?  I'm just wondering if there is something Lustre is doing that takes too long when there is a large amount of RAM (e.g. managing DLM locks, pages in RAM, etc.)?

I see that you are using MOFED on the client, and in-kernel OFED on the servers.  I'm not a network person, but this could be a source of problems at the network level.  You might consider to enable LNet error logging to the console, to see if there are connection problems reported:

    client# lctl set_param printk=+neterror

If this problem is seen on a regular basis, you could try changing a few Lustre tunables to see if that reduces/eliminates the problem.  Firstly, reducing the amount of data and locks that Lustre will cache may help avoid the time it is spending on requests from the servers:

    client# lctl set_param llite.*.max_cached_mb=128G
    client# lctl set_param ldlm.namespaces.*.lru_size=1024

If it is not a network problem, but rather a "client is busy" problem that is helped by reducing the RAM/locks usage, it would be useful to make "perf trace" flame graphs to see where the CPU is used:

    http://www.brendangregg.com/flamegraphs.html

Cheers, Andreas

> The eviction appears to come while the application is processing a large number (~300) of data “chunks” (i.e. files) which occur in pairs.
> 
> -rw-r--r-- 1 chasman ufhpc 24 Nov 28 23:31 ./Tdtest915/ASSEMBLER_CS/_ASSEMBLER/_ASM_SN/SHARD_ASM/fork0/join/files/chunk233.sedge_bcs
> -rw-r--r-- 1 chasman ufhpc 34M Nov 28 23:31 ./Tdtest915/ASSEMBLER_CS/_ASSEMBLER/_ASM_SN/SHARD_ASM/fork0/join/files/chunk233.sedge_asm
> 
> I assume the 24-byte file is metadata (an index or some such) and the 34M file is the actual data but I’m just guessing since I’m completely unfamiliar with the application.  
> 
> The write error is,
> 
>    #define ENOTCONN        107     /* Transport endpoint is not connected */
> 
> which occurs after the OSS eviction.  This was reproducible under 2.5.3.90 as well.  We hoped that upgrading to 2.10.1 would resolve the issue but it has not.  
> 
> This is the first application (in 10 years) we have encountered that consistently and reliably fails when run over Lustre.  I’m not sure at this point whether this is a bug or tuning issue.
> If others have encountered and overcome something like this, we’d be grateful to hear from you.
> 
> Regards,
> 
> Charles Taylor
> UF Research Computing
> 
> OSS:
> --------------
> Nov 28 23:41:41 ufrcoss28 kernel: LustreError: 0:0:(ldlm_lockd.c:334:waiting_locks_callback()) ### lock callback timer expired after 201s: evicing client at 10.13.136.74 at o2ib  ns: filter-testfs-OST002e_UUID lock: ffff880041717400/0x9bd23c8dc69323a1 lrc: 3/0,0 mode: PW/PW res: [0x7ef2:0x0:0x0].0x0 rrc: 3 type: EXT [0->18446744073709551615] (req 4096->1802239) flags: 0x60000400010020 nid: 10.13.136.74 at o2ib remote: 0xe54f26957f2ac591 expref: 45 pid: 6836 timeout: 6488120506 lvb_type: 0
> 
> Client:
> ———————
> Nov 28 23:41:42 s5a-s23 kernel: LustreError: 11-0: testfs-OST002e-osc-ffff88c053fe3800: operation ost_write to node 10.13.136.30 at o2ib failed: rc = -107
> Nov 28 23:41:42 s5a-s23 kernel: Lustre: testfs-OST002e-osc-ffff88c053fe3800: Connection to testfs-OST002e (at 10.13.136.30 at o2ib) was lost; in progress operations using this service will wait for recovery to complete
> Nov 28 23:41:42 s5a-s23 kernel: LustreError: 167-0: testfs-OST002e-osc-ffff88c053fe3800: This client was evicted by testfs-OST002e; in progress operations using this service will fail.
> Nov 28 23:41:42 s5a-s23 kernel: LustreError: 11-0: testfs-OST002c-osc-ffff88c053fe3800: operation ost_punch to node 10.13.136.30 at o2ib failed: rc = -107
> Nov 28 23:41:42 s5a-s23 kernel: Lustre: testfs-OST002c-osc-ffff88c053fe3800: Connection to testfs-OST002c (at 10.13.136.30 at o2ib) was lost; in progress operations using this service will wait for recovery to complete
> Nov 28 23:41:42 s5a-s23 kernel: LustreError: 167-0: testfs-OST002c-osc-ffff88c053fe3800: This client was evicted by testfs-OST002c; in progress operations using this service will fail.
> Nov 28 23:41:47 s5a-s23 kernel: LustreError: 11-0: testfs-OST0000-osc-ffff88c053fe3800: operation ost_statfs to node 10.13.136.23 at o2ib failed: rc = -107
> Nov 28 23:41:47 s5a-s23 kernel: Lustre: testfs-OST0000-osc-ffff88c053fe3800: Connection to testfs-OST0000 (at 10.13.136.23 at o2ib) was lost; in progress operations using this service will wait for recovery to complete
> Nov 28 23:41:47 s5a-s23 kernel: LustreError: 167-0: testfs-OST0004-osc-ffff88c053fe3800: This client was evicted by testfs-OST0004; in progress operations using this service will fail.
> Nov 28 23:43:11 s5a-s23 kernel: Lustre: testfs-OST0006-osc-ffff88c053fe3800: Connection restored to 10.13.136.24 at o2ib (at 10.13.136.24 at o2ib)
> Nov 28 23:43:38 s5a-s23 kernel: Lustre: testfs-OST002c-osc-ffff88c053fe3800: Connection restored to 10.13.136.30 at o2ib (at 10.13.136.30 at o2ib)
> Nov 28 23:43:45 s5a-s23 kernel: Lustre: testfs-OST0000-osc-ffff88c053fe3800: Connection restored to 10.13.136.23 at o2ib (at 10.13.136.23 at o2ib)
> Nov 28 23:43:48 s5a-s23 kernel: Lustre: testfs-OST0004-osc-ffff88c053fe3800: Connection restored to 10.13.136.23 at o2ib (at 10.13.136.23 at o2ib)
> Nov 28 23:43:48 s5a-s23 kernel: Lustre: Skipped 3 previous similar messages
> Nov 28 23:43:55 s5a-s23 kernel: Lustre: testfs-OST0007-osc-ffff88c053fe3800: Connection restored to 10.13.136.24 at o2ib (at 10.13.136.24 at o2ib)
> Nov 28 23:43:55 s5a-s23 kernel: Lustre: Skipped 4 previous similar messages
> 
> Some Details:
> -------------------
> OS: RHEL 7.4 (Linux ufrcoss28.ufhpc 3.10.0-693.2.2.el7_lustre.x86_64)
> Lustre: 2.10.1 (lustre-2.10.1-1.el7.x86_64)
> Client: 2.10.1 
>     1 TB RAM
>     Mellanox ConnectX-3 IB/VPI HCAs
>     Linux s5a-s23.ufhpc 2.6.32-696.13.2.el6.x86_64 
>     MOFED 3.2.2 IB stack
>     Lustre 2.10.1
> Servers: 10 HA OSS pairs (20 OSSs)
>   128 GB RAM
>   6 OSTs (8+2 RAID-6) per OSS 
>   Mellanox ConnectX-3 IB/VPI HCAs 
>   RedHat EL7 Native IB Stack (i.e. not MOFED)

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Intel Corporation