<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
</head>
<body text="#000000" bgcolor="#FFFFFF">
<div class="moz-cite-prefix">Hello,<br>
are you using Infiniband ?<br>
if so what are the peer credit settings ?<br>
<br>
cat /proc/sys/lnet/nis <br>
cat /proc/sys/lnet/peers <br>
<br>
<br>
On 12/3/17 8:38 AM, E.S. Rosenberg wrote:<br>
</div>
<blockquote type="cite"
cite="mid:CA+K1OzT7NQ6JUNOgSyKfNQtqioy0bUs7YMwTNscrV8rcORnVog@mail.gmail.com">
<div dir="ltr">Did you find the problem? Were there any useful
suggestions off-list?<br>
</div>
<div class="gmail_extra"><br>
<div class="gmail_quote">On Wed, Nov 29, 2017 at 1:34 PM,
Charles A Taylor <span dir="ltr"><<a
href="mailto:chasman@ufl.edu" target="_blank"
moz-do-not-send="true">chasman@ufl.edu</a>></span>
wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0
.8ex;border-left:1px #ccc solid;padding-left:1ex"><br>
We have a genomics pipeline app (supernova) that fails
consistently due to the client being evicted on the OSSs
with a “lock callback timer expired”. I doubled
“nlm_enqueue_min” across the cluster but then the timer
simply expired after 200s rather than 100s so I don’t think
that is the answer. The syslog/dmesg on the client shows
no signs of distress and it is a “bigmem” machine with 1TB
of RAM.<br>
<br>
The eviction appears to come while the application is
processing a large number (~300) of data “chunks” (i.e.
files) which occur in pairs.<br>
<br>
-rw-r--r-- 1 chasman ufhpc 24 Nov 28 23:31
./Tdtest915/ASSEMBLER_CS/_<wbr>ASSEMBLER/_ASM_SN/SHARD_ASM/<wbr>fork0/join/files/chunk233.<wbr>sedge_bcs<br>
-rw-r--r-- 1 chasman ufhpc 34M Nov 28 23:31
./Tdtest915/ASSEMBLER_CS/_<wbr>ASSEMBLER/_ASM_SN/SHARD_ASM/<wbr>fork0/join/files/chunk233.<wbr>sedge_asm<br>
<br>
I assume the 24-byte file is metadata (an index or some
such) and the 34M file is the actual data but I’m just
guessing since I’m completely unfamiliar with the
application.<br>
<br>
The write error is,<br>
<br>
#define ENOTCONN 107 /* Transport endpoint is
not connected */<br>
<br>
which occurs after the OSS eviction. This was reproducible
under 2.5.3.90 as well. We hoped that upgrading to 2.10.1
would resolve the issue but it has not.<br>
<br>
This is the first application (in 10 years) we have
encountered that consistently and reliably fails when run
over Lustre. I’m not sure at this point whether this is a
bug or tuning issue.<br>
If others have encountered and overcome something like this,
we’d be grateful to hear from you.<br>
<br>
Regards,<br>
<br>
Charles Taylor<br>
UF Research Computing<br>
<br>
OSS:<br>
--------------<br>
Nov 28 23:41:41 ufrcoss28 kernel: LustreError:
0:0:(ldlm_lockd.c:334:waiting_<wbr>locks_callback()) ###
lock callback timer expired after 201s: evicing client at
10.13.136.74@o2ib ns: filter-testfs-OST002e_UUID lock:
ffff880041717400/<wbr>0x9bd23c8dc69323a1 lrc: 3/0,0 mode:
PW/PW res: [0x7ef2:0x0:0x0].0x0 rrc: 3 type: EXT
[0->18446744073709551615] (req 4096->1802239) flags:
0x60000400010020 nid: 10.13.136.74@o2ib remote:
0xe54f26957f2ac591 expref: 45 pid: 6836 timeout: 6488120506
lvb_type: 0<br>
<br>
Client:<br>
———————<br>
Nov 28 23:41:42 s5a-s23 kernel: LustreError: 11-0:
testfs-OST002e-osc-<wbr>ffff88c053fe3800: operation
ost_write to node 10.13.136.30@o2ib failed: rc = -107<br>
Nov 28 23:41:42 s5a-s23 kernel: Lustre: testfs-OST002e-osc-<wbr>ffff88c053fe3800:
Connection to testfs-OST002e (at 10.13.136.30@o2ib) was
lost; in progress operations using this service will wait
for recovery to complete<br>
Nov 28 23:41:42 s5a-s23 kernel: LustreError: 167-0:
testfs-OST002e-osc-<wbr>ffff88c053fe3800: This client was
evicted by testfs-OST002e; in progress operations using this
service will fail.<br>
Nov 28 23:41:42 s5a-s23 kernel: LustreError: 11-0:
testfs-OST002c-osc-<wbr>ffff88c053fe3800: operation
ost_punch to node 10.13.136.30@o2ib failed: rc = -107<br>
Nov 28 23:41:42 s5a-s23 kernel: Lustre: testfs-OST002c-osc-<wbr>ffff88c053fe3800:
Connection to testfs-OST002c (at 10.13.136.30@o2ib) was
lost; in progress operations using this service will wait
for recovery to complete<br>
Nov 28 23:41:42 s5a-s23 kernel: LustreError: 167-0:
testfs-OST002c-osc-<wbr>ffff88c053fe3800: This client was
evicted by testfs-OST002c; in progress operations using this
service will fail.<br>
Nov 28 23:41:47 s5a-s23 kernel: LustreError: 11-0:
testfs-OST0000-osc-<wbr>ffff88c053fe3800: operation
ost_statfs to node 10.13.136.23@o2ib failed: rc = -107<br>
Nov 28 23:41:47 s5a-s23 kernel: Lustre: testfs-OST0000-osc-<wbr>ffff88c053fe3800:
Connection to testfs-OST0000 (at 10.13.136.23@o2ib) was
lost; in progress operations using this service will wait
for recovery to complete<br>
Nov 28 23:41:47 s5a-s23 kernel: LustreError: 167-0:
testfs-OST0004-osc-<wbr>ffff88c053fe3800: This client was
evicted by testfs-OST0004; in progress operations using this
service will fail.<br>
Nov 28 23:43:11 s5a-s23 kernel: Lustre: testfs-OST0006-osc-<wbr>ffff88c053fe3800:
Connection restored to 10.13.136.24@o2ib (at
10.13.136.24@o2ib)<br>
Nov 28 23:43:38 s5a-s23 kernel: Lustre: testfs-OST002c-osc-<wbr>ffff88c053fe3800:
Connection restored to 10.13.136.30@o2ib (at
10.13.136.30@o2ib)<br>
Nov 28 23:43:45 s5a-s23 kernel: Lustre: testfs-OST0000-osc-<wbr>ffff88c053fe3800:
Connection restored to 10.13.136.23@o2ib (at
10.13.136.23@o2ib)<br>
Nov 28 23:43:48 s5a-s23 kernel: Lustre: testfs-OST0004-osc-<wbr>ffff88c053fe3800:
Connection restored to 10.13.136.23@o2ib (at
10.13.136.23@o2ib)<br>
Nov 28 23:43:48 s5a-s23 kernel: Lustre: Skipped 3 previous
similar messages<br>
Nov 28 23:43:55 s5a-s23 kernel: Lustre: testfs-OST0007-osc-<wbr>ffff88c053fe3800:
Connection restored to 10.13.136.24@o2ib (at
10.13.136.24@o2ib)<br>
Nov 28 23:43:55 s5a-s23 kernel: Lustre: Skipped 4 previous
similar messages<br>
<br>
Some Details:<br>
-------------------<br>
OS: RHEL 7.4 (Linux ufrcoss28.ufhpc
3.10.0-693.2.2.el7_lustre.x86_<wbr>64)<br>
Lustre: 2.10.1 (lustre-2.10.1-1.el7.x86_64)<br>
Client: 2.10.1<br>
1 TB RAM<br>
Mellanox ConnectX-3 IB/VPI HCAs<br>
Linux s5a-s23.ufhpc 2.6.32-696.13.2.el6.x86_64<br>
MOFED 3.2.2 IB stack<br>
Lustre 2.10.1<br>
Servers: 10 HA OSS pairs (20 OSSs)<br>
128 GB RAM<br>
6 OSTs (8+2 RAID-6) per OSS<br>
Mellanox ConnectX-3 IB/VPI HCAs<br>
RedHat EL7 Native IB Stack (i.e. not MOFED)<br>
<br>
______________________________<wbr>_________________<br>
lustre-discuss mailing list<br>
<a href="mailto:lustre-discuss@lists.lustre.org"
moz-do-not-send="true">lustre-discuss@lists.lustre.<wbr>org</a><br>
<a
href="http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org"
rel="noreferrer" target="_blank" moz-do-not-send="true">http://lists.lustre.org/<wbr>listinfo.cgi/lustre-discuss-<wbr>lustre.org</a><br>
</blockquote>
</div>
<br>
</div>
<br>
<fieldset class="mimeAttachmentHeader"></fieldset>
<br>
<pre wrap="">_______________________________________________
lustre-discuss mailing list
<a class="moz-txt-link-abbreviated" href="mailto:lustre-discuss@lists.lustre.org">lustre-discuss@lists.lustre.org</a>
<a class="moz-txt-link-freetext" href="http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org">http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org</a>
</pre>
</blockquote>
<p><br>
</p>
</body>
</html>