[lustre-discuss] Lustre 2.10.1 + RHEL7 Lock Callback Timer Expired
Riccardo Veraldi
riccardo.veraldi at gmail.com
Sun Dec 3 08:55:47 PST 2017
Hello,
are you using Infiniband ?
if so what are the peer credit settings ?
cat /proc/sys/lnet/nis
cat /proc/sys/lnet/peers
On 12/3/17 8:38 AM, E.S. Rosenberg wrote:
> Did you find the problem? Were there any useful suggestions off-list?
>
> On Wed, Nov 29, 2017 at 1:34 PM, Charles A Taylor <chasman at ufl.edu
> <mailto:chasman at ufl.edu>> wrote:
>
>
> We have a genomics pipeline app (supernova) that fails
> consistently due to the client being evicted on the OSSs with a
> “lock callback timer expired”. I doubled “nlm_enqueue_min” across
> the cluster but then the timer simply expired after 200s rather
> than 100s so I don’t think that is the answer. The syslog/dmesg
> on the client shows no signs of distress and it is a “bigmem”
> machine with 1TB of RAM.
>
> The eviction appears to come while the application is processing a
> large number (~300) of data “chunks” (i.e. files) which occur in
> pairs.
>
> -rw-r--r-- 1 chasman ufhpc 24 Nov 28 23:31
> ./Tdtest915/ASSEMBLER_CS/_ASSEMBLER/_ASM_SN/SHARD_ASM/fork0/join/files/chunk233.sedge_bcs
> -rw-r--r-- 1 chasman ufhpc 34M Nov 28 23:31
> ./Tdtest915/ASSEMBLER_CS/_ASSEMBLER/_ASM_SN/SHARD_ASM/fork0/join/files/chunk233.sedge_asm
>
> I assume the 24-byte file is metadata (an index or some such) and
> the 34M file is the actual data but I’m just guessing since I’m
> completely unfamiliar with the application.
>
> The write error is,
>
> #define ENOTCONN 107 /* Transport endpoint is not
> connected */
>
> which occurs after the OSS eviction. This was reproducible under
> 2.5.3.90 as well. We hoped that upgrading to 2.10.1 would resolve
> the issue but it has not.
>
> This is the first application (in 10 years) we have encountered
> that consistently and reliably fails when run over Lustre. I’m
> not sure at this point whether this is a bug or tuning issue.
> If others have encountered and overcome something like this, we’d
> be grateful to hear from you.
>
> Regards,
>
> Charles Taylor
> UF Research Computing
>
> OSS:
> --------------
> Nov 28 23:41:41 ufrcoss28 kernel: LustreError:
> 0:0:(ldlm_lockd.c:334:waiting_locks_callback()) ### lock callback
> timer expired after 201s: evicing client at 10.13.136.74 at o2ib ns:
> filter-testfs-OST002e_UUID lock:
> ffff880041717400/0x9bd23c8dc69323a1 lrc: 3/0,0 mode: PW/PW res:
> [0x7ef2:0x0:0x0].0x0 rrc: 3 type: EXT [0->18446744073709551615]
> (req 4096->1802239) flags: 0x60000400010020 nid: 10.13.136.74 at o2ib
> remote: 0xe54f26957f2ac591 expref: 45 pid: 6836 timeout:
> 6488120506 lvb_type: 0
>
> Client:
> ———————
> Nov 28 23:41:42 s5a-s23 kernel: LustreError: 11-0:
> testfs-OST002e-osc-ffff88c053fe3800: operation ost_write to node
> 10.13.136.30 at o2ib failed: rc = -107
> Nov 28 23:41:42 s5a-s23 kernel: Lustre:
> testfs-OST002e-osc-ffff88c053fe3800: Connection to testfs-OST002e
> (at 10.13.136.30 at o2ib) was lost; in progress operations using this
> service will wait for recovery to complete
> Nov 28 23:41:42 s5a-s23 kernel: LustreError: 167-0:
> testfs-OST002e-osc-ffff88c053fe3800: This client was evicted by
> testfs-OST002e; in progress operations using this service will fail.
> Nov 28 23:41:42 s5a-s23 kernel: LustreError: 11-0:
> testfs-OST002c-osc-ffff88c053fe3800: operation ost_punch to node
> 10.13.136.30 at o2ib failed: rc = -107
> Nov 28 23:41:42 s5a-s23 kernel: Lustre:
> testfs-OST002c-osc-ffff88c053fe3800: Connection to testfs-OST002c
> (at 10.13.136.30 at o2ib) was lost; in progress operations using this
> service will wait for recovery to complete
> Nov 28 23:41:42 s5a-s23 kernel: LustreError: 167-0:
> testfs-OST002c-osc-ffff88c053fe3800: This client was evicted by
> testfs-OST002c; in progress operations using this service will fail.
> Nov 28 23:41:47 s5a-s23 kernel: LustreError: 11-0:
> testfs-OST0000-osc-ffff88c053fe3800: operation ost_statfs to node
> 10.13.136.23 at o2ib failed: rc = -107
> Nov 28 23:41:47 s5a-s23 kernel: Lustre:
> testfs-OST0000-osc-ffff88c053fe3800: Connection to testfs-OST0000
> (at 10.13.136.23 at o2ib) was lost; in progress operations using this
> service will wait for recovery to complete
> Nov 28 23:41:47 s5a-s23 kernel: LustreError: 167-0:
> testfs-OST0004-osc-ffff88c053fe3800: This client was evicted by
> testfs-OST0004; in progress operations using this service will fail.
> Nov 28 23:43:11 s5a-s23 kernel: Lustre:
> testfs-OST0006-osc-ffff88c053fe3800: Connection restored to
> 10.13.136.24 at o2ib (at 10.13.136.24 at o2ib)
> Nov 28 23:43:38 s5a-s23 kernel: Lustre:
> testfs-OST002c-osc-ffff88c053fe3800: Connection restored to
> 10.13.136.30 at o2ib (at 10.13.136.30 at o2ib)
> Nov 28 23:43:45 s5a-s23 kernel: Lustre:
> testfs-OST0000-osc-ffff88c053fe3800: Connection restored to
> 10.13.136.23 at o2ib (at 10.13.136.23 at o2ib)
> Nov 28 23:43:48 s5a-s23 kernel: Lustre:
> testfs-OST0004-osc-ffff88c053fe3800: Connection restored to
> 10.13.136.23 at o2ib (at 10.13.136.23 at o2ib)
> Nov 28 23:43:48 s5a-s23 kernel: Lustre: Skipped 3 previous similar
> messages
> Nov 28 23:43:55 s5a-s23 kernel: Lustre:
> testfs-OST0007-osc-ffff88c053fe3800: Connection restored to
> 10.13.136.24 at o2ib (at 10.13.136.24 at o2ib)
> Nov 28 23:43:55 s5a-s23 kernel: Lustre: Skipped 4 previous similar
> messages
>
> Some Details:
> -------------------
> OS: RHEL 7.4 (Linux ufrcoss28.ufhpc 3.10.0-693.2.2.el7_lustre.x86_64)
> Lustre: 2.10.1 (lustre-2.10.1-1.el7.x86_64)
> Client: 2.10.1
> 1 TB RAM
> Mellanox ConnectX-3 IB/VPI HCAs
> Linux s5a-s23.ufhpc 2.6.32-696.13.2.el6.x86_64
> MOFED 3.2.2 IB stack
> Lustre 2.10.1
> Servers: 10 HA OSS pairs (20 OSSs)
> 128 GB RAM
> 6 OSTs (8+2 RAID-6) per OSS
> Mellanox ConnectX-3 IB/VPI HCAs
> RedHat EL7 Native IB Stack (i.e. not MOFED)
>
> _______________________________________________
> lustre-discuss mailing list
> lustre-discuss at lists.lustre.org
> <mailto:lustre-discuss at lists.lustre.org>
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
> <http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org>
>
>
>
>
> _______________________________________________
> lustre-discuss mailing list
> lustre-discuss at lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20171203/0160850e/attachment-0001.html>
More information about the lustre-discuss
mailing list