[lustre-discuss] Lustre 2.10.1 + RHEL7 Lock Callback Timer Expired

E.S. Rosenberg esr+lustre at mail.hebrew.edu
Sun Dec 3 08:38:08 PST 2017


Did you find the problem? Were there any useful suggestions off-list?

On Wed, Nov 29, 2017 at 1:34 PM, Charles A Taylor <chasman at ufl.edu> wrote:

>
> We have a genomics pipeline app (supernova) that fails consistently due to
> the client being evicted on the OSSs with a  “lock callback timer
> expired”.  I doubled “nlm_enqueue_min” across the cluster but then the
> timer simply expired after 200s rather than 100s so I don’t think that is
> the answer.   The syslog/dmesg on the client shows no signs of distress and
> it is a “bigmem” machine with 1TB of RAM.
>
> The eviction appears to come while the application is processing a large
> number (~300) of data “chunks” (i.e. files) which occur in pairs.
>
> -rw-r--r-- 1 chasman ufhpc 24 Nov 28 23:31 ./Tdtest915/ASSEMBLER_CS/_
> ASSEMBLER/_ASM_SN/SHARD_ASM/fork0/join/files/chunk233.sedge_bcs
> -rw-r--r-- 1 chasman ufhpc 34M Nov 28 23:31 ./Tdtest915/ASSEMBLER_CS/_
> ASSEMBLER/_ASM_SN/SHARD_ASM/fork0/join/files/chunk233.sedge_asm
>
> I assume the 24-byte file is metadata (an index or some such) and the 34M
> file is the actual data but I’m just guessing since I’m completely
> unfamiliar with the application.
>
> The write error is,
>
>     #define ENOTCONN        107     /* Transport endpoint is not connected
> */
>
> which occurs after the OSS eviction.  This was reproducible under 2.5.3.90
> as well.  We hoped that upgrading to 2.10.1 would resolve the issue but it
> has not.
>
> This is the first application (in 10 years) we have encountered that
> consistently and reliably fails when run over Lustre.  I’m not sure at this
> point whether this is a bug or tuning issue.
> If others have encountered and overcome something like this, we’d be
> grateful to hear from you.
>
> Regards,
>
> Charles Taylor
> UF Research Computing
>
> OSS:
> --------------
> Nov 28 23:41:41 ufrcoss28 kernel: LustreError:
> 0:0:(ldlm_lockd.c:334:waiting_locks_callback()) ### lock callback timer
> expired after 201s: evicing client at 10.13.136.74 at o2ib  ns:
> filter-testfs-OST002e_UUID lock: ffff880041717400/0x9bd23c8dc69323a1 lrc:
> 3/0,0 mode: PW/PW res: [0x7ef2:0x0:0x0].0x0 rrc: 3 type: EXT
> [0->18446744073709551615] (req 4096->1802239) flags: 0x60000400010020 nid:
> 10.13.136.74 at o2ib remote: 0xe54f26957f2ac591 expref: 45 pid: 6836
> timeout: 6488120506 lvb_type: 0
>
> Client:
> ———————
> Nov 28 23:41:42 s5a-s23 kernel: LustreError: 11-0: testfs-OST002e-osc-ffff88c053fe3800:
> operation ost_write to node 10.13.136.30 at o2ib failed: rc = -107
> Nov 28 23:41:42 s5a-s23 kernel: Lustre: testfs-OST002e-osc-ffff88c053fe3800:
> Connection to testfs-OST002e (at 10.13.136.30 at o2ib) was lost; in progress
> operations using this service will wait for recovery to complete
> Nov 28 23:41:42 s5a-s23 kernel: LustreError: 167-0: testfs-OST002e-osc-ffff88c053fe3800:
> This client was evicted by testfs-OST002e; in progress operations using
> this service will fail.
> Nov 28 23:41:42 s5a-s23 kernel: LustreError: 11-0: testfs-OST002c-osc-ffff88c053fe3800:
> operation ost_punch to node 10.13.136.30 at o2ib failed: rc = -107
> Nov 28 23:41:42 s5a-s23 kernel: Lustre: testfs-OST002c-osc-ffff88c053fe3800:
> Connection to testfs-OST002c (at 10.13.136.30 at o2ib) was lost; in progress
> operations using this service will wait for recovery to complete
> Nov 28 23:41:42 s5a-s23 kernel: LustreError: 167-0: testfs-OST002c-osc-ffff88c053fe3800:
> This client was evicted by testfs-OST002c; in progress operations using
> this service will fail.
> Nov 28 23:41:47 s5a-s23 kernel: LustreError: 11-0: testfs-OST0000-osc-ffff88c053fe3800:
> operation ost_statfs to node 10.13.136.23 at o2ib failed: rc = -107
> Nov 28 23:41:47 s5a-s23 kernel: Lustre: testfs-OST0000-osc-ffff88c053fe3800:
> Connection to testfs-OST0000 (at 10.13.136.23 at o2ib) was lost; in progress
> operations using this service will wait for recovery to complete
> Nov 28 23:41:47 s5a-s23 kernel: LustreError: 167-0: testfs-OST0004-osc-ffff88c053fe3800:
> This client was evicted by testfs-OST0004; in progress operations using
> this service will fail.
> Nov 28 23:43:11 s5a-s23 kernel: Lustre: testfs-OST0006-osc-ffff88c053fe3800:
> Connection restored to 10.13.136.24 at o2ib (at 10.13.136.24 at o2ib)
> Nov 28 23:43:38 s5a-s23 kernel: Lustre: testfs-OST002c-osc-ffff88c053fe3800:
> Connection restored to 10.13.136.30 at o2ib (at 10.13.136.30 at o2ib)
> Nov 28 23:43:45 s5a-s23 kernel: Lustre: testfs-OST0000-osc-ffff88c053fe3800:
> Connection restored to 10.13.136.23 at o2ib (at 10.13.136.23 at o2ib)
> Nov 28 23:43:48 s5a-s23 kernel: Lustre: testfs-OST0004-osc-ffff88c053fe3800:
> Connection restored to 10.13.136.23 at o2ib (at 10.13.136.23 at o2ib)
> Nov 28 23:43:48 s5a-s23 kernel: Lustre: Skipped 3 previous similar messages
> Nov 28 23:43:55 s5a-s23 kernel: Lustre: testfs-OST0007-osc-ffff88c053fe3800:
> Connection restored to 10.13.136.24 at o2ib (at 10.13.136.24 at o2ib)
> Nov 28 23:43:55 s5a-s23 kernel: Lustre: Skipped 4 previous similar messages
>
> Some Details:
> -------------------
> OS: RHEL 7.4 (Linux ufrcoss28.ufhpc 3.10.0-693.2.2.el7_lustre.x86_64)
> Lustre: 2.10.1 (lustre-2.10.1-1.el7.x86_64)
> Client: 2.10.1
>      1 TB RAM
>      Mellanox ConnectX-3 IB/VPI HCAs
>      Linux s5a-s23.ufhpc 2.6.32-696.13.2.el6.x86_64
>      MOFED 3.2.2 IB stack
>      Lustre 2.10.1
> Servers: 10 HA OSS pairs (20 OSSs)
>    128 GB RAM
>    6 OSTs (8+2 RAID-6) per OSS
>    Mellanox ConnectX-3 IB/VPI HCAs
>    RedHat EL7 Native IB Stack (i.e. not MOFED)
>
> _______________________________________________
> lustre-discuss mailing list
> lustre-discuss at lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20171203/07a2db9a/attachment.html>


More information about the lustre-discuss mailing list