[lustre-discuss] Lustre 2.10.1 + RHEL7 Lock Callback Timer Expired

Charles A Taylor chasman at ufl.edu
Mon Dec 11 07:27:56 PST 2017


> On Dec 3, 2017, at 11:55 AM, Riccardo Veraldi <riccardo.veraldi at gmail.com> wrote:
> 
> Hello,
> are you using Infiniband ?

Yes - Mellanox FDR (Connect-IB, ConnectX-3)

> if so what are the peer credit settings ?

Defaults.  Note that while we are in the process of upgrading our clients to 2.10.1, most are still 2.5.3.90.
I think I indicated previously that we saw this with both 2.5.3.90 and 2.10.1 using the same client (i.e. we upgraded it to try the 2.10.1 client modules).

Although we saw this on our standard 128G haswell clients with this particular application, they may have been RAM constrained which is why we went to the “bigmem” host.  However, the “bigmem” host is a bit of an odd bird itself so we may have just traded one client issue for another.  

The application ran successfully (just finished Friday) on a newer “bigmem” server without being evicted.  I’d like to understand what was causing the client to fail to respond to the OSS in a timely fashion but that may take more effort than is warranted at this point.

> 
>  cat /proc/sys/lnet/nis 

nid                      status alive refs peer  rtr   max    tx   min
0 at lo                         up     0    2    0    0     0     0     0
0 at lo                         up     0    0    0    0     0     0     0
0 at lo                         up     0    0    0    0     0     0     0
0 at lo                         up     0    0    0    0     0     0     0
0 at lo                         up     0    0    0    0     0     0     0
0 at lo                         up     0    0    0    0     0     0     0
0 at lo                         up     0    0    0    0     0     0     0
0 at lo                         up     0    0    0    0     0     0     0
10.13.136.74 at o2ib            up    -1    1    8    0    64    64    32
10.13.136.74 at o2ib            up    -1    0    8    0    64    64    42
10.13.136.74 at o2ib            up    -1    0    8    0    64    64    26
10.13.136.74 at o2ib            up    -1    0    8    0    64    64    30
10.13.136.74 at o2ib            up    -1    0    8    0    64    64    43
10.13.136.74 at o2ib            up    -1    0    8    0    64    64    34
10.13.136.74 at o2ib            up    -1    0    8    0    64    64    30
10.13.136.74 at o2ib            up    -1    0    8    0    64    64    47

>  cat /proc/sys/lnet/peers 

[root at s5a-s23 lnet]# cat /proc/sys/lnet/peers 
nid                      refs state  last   max   rtr   min    tx   min queue
10.13.136.159 at o2ib          1    NA    -1     8     8     8     8    -1 0
10.13.136.15 at o2ib           1    NA    -1     8     8     8     8     2 0
10.13.136.28 at o2ib           1    NA    -1     8     8     8     8   -16 0
10.13.136.7 at o2ib            1    NA    -1     8     8     8     8     2 0
10.13.136.130 at o2ib          1    NA    -1     8     8     8     8     1 0
10.13.136.164 at o2ib          1    NA    -1     8     8     8     8    -1 0
10.13.136.20 at o2ib           1    NA    -1     8     8     8     8     2 0
10.13.136.156 at o2ib          1    NA    -1     8     8     8     8     1 0
10.13.136.12 at o2ib           1    NA    -1     8     8     8     8     2 0
10.13.136.169 at o2ib          1    NA    -1     8     8     8     8     2 0
10.13.136.25 at o2ib           1    NA    -1     8     8     8     8   -41 0
10.13.136.4 at o2ib            1    NA    -1     8     8     8     8     2 0
10.13.136.161 at o2ib          1    NA    -1     8     8     8     8    -1 0
10.13.136.17 at o2ib           1    NA    -1     8     8     8     8     2 0
10.13.136.51 at o2ib           1    NA    -1     8     8     8     8     6 0
10.13.136.30 at o2ib           1    NA    -1     8     8     8     8   -27 0
10.13.136.153 at o2ib          1    NA    -1     8     8     8     8     2 0
10.13.136.9 at o2ib            1    NA    -1     8     8     8     8     2 0
10.13.136.166 at o2ib          1    NA    -1     8     8     8     8     2 0
10.13.136.22 at o2ib           1    NA    -1     8     8     8     8     2 0
10.13.136.1 at o2ib            1    NA    -1     8     8     8     8     0 0
10.13.136.158 at o2ib          1    NA    -1     8     8     8     8    -1 0
10.13.136.14 at o2ib           1    NA    -1     8     8     8     8     2 0
10.13.136.171 at o2ib          1    NA    -1     8     8     8     8     2 0
10.13.136.27 at o2ib           1    NA    -1     8     8     8     8   -13 0
10.13.136.6 at o2ib            1    NA    -1     8     8     8     8     2 0
10.13.136.129 at o2ib          1    NA    -1     8     8     8     8     7 0
10.13.136.163 at o2ib          1    NA    -1     8     8     8     8    -1 0
10.13.136.19 at o2ib           1    NA    -1     8     8     8     8     1 0
10.13.136.32 at o2ib           1    NA    -1     8     8     8     8   -46 0
10.13.136.155 at o2ib          1    NA    -1     8     8     8     8     2 0
10.13.136.11 at o2ib           1    NA    -1     8     8     8     8    -4 0
10.13.136.168 at o2ib          1    NA    -1     8     8     8     8     2 0
10.13.136.24 at o2ib           1    NA    -1     8     8     8     8   -65 0
10.13.136.3 at o2ib            1    NA    -1     8     8     8     8     2 0
10.13.136.160 at o2ib          1    NA    -1     8     8     8     8    -1 0
10.13.136.16 at o2ib           1    NA    -1     8     8     8     8     2 0
10.13.136.29 at o2ib           1    NA    -1     8     8     8     8   -46 0
10.13.136.8 at o2ib            1    NA    -1     8     8     8     8     2 0
10.13.136.165 at o2ib          1    NA    -1     8     8     8     8     2 0
10.13.136.21 at o2ib           1    NA    -1     8     8     8     8     2 0
10.13.136.157 at o2ib          1    NA    -1     8     8     8     8     1 0
10.13.136.13 at o2ib           1    NA    -1     8     8     8     8     2 0
10.13.136.170 at o2ib          1    NA    -1     8     8     8     8     2 0
10.13.136.26 at o2ib           1    NA    -1     8     8     8     8   -46 0
10.13.136.5 at o2ib            1    NA    -1     8     8     8     8     2 0
10.13.136.162 at o2ib          1    NA    -1     8     8     8     8    -1 0
10.13.136.18 at o2ib           1    NA    -1     8     8     8     8     2 0
10.13.136.31 at o2ib           1    NA    -1     8     8     8     8   -40 0
10.13.136.154 at o2ib          1    NA    -1     8     8     8     8     2 0
10.13.136.10 at o2ib           1    NA    -1     8     8     8     8     2 0
10.13.136.167 at o2ib          1    NA    -1     8     8     8     8     2 0
10.13.136.23 at o2ib           1    NA    -1     8     8     8     8   -53 0
10.13.136.2 at o2ib            1    NA    -1     8     8     8     8     8 0


> 
> 
> On 12/3/17 8:38 AM, E.S. Rosenberg wrote:
>> Did you find the problem? Were there any useful suggestions off-list?
>> 
>> On Wed, Nov 29, 2017 at 1:34 PM, Charles A Taylor <chasman at ufl.edu> wrote:
>> 
>> We have a genomics pipeline app (supernova) that fails consistently due to the client being evicted on the OSSs with a  “lock callback timer expired”.  I doubled “nlm_enqueue_min” across the cluster but then the timer simply expired after 200s rather than 100s so I don’t think that is the answer.   The syslog/dmesg on the client shows no signs of distress and it is a “bigmem” machine with 1TB of RAM.
>> 
>> The eviction appears to come while the application is processing a large number (~300) of data “chunks” (i.e. files) which occur in pairs.
>> 
>> -rw-r--r-- 1 chasman ufhpc 24 Nov 28 23:31 ./Tdtest915/ASSEMBLER_CS/_ASSEMBLER/_ASM_SN/SHARD_ASM/fork0/join/files/chunk233.sedge_bcs
>> -rw-r--r-- 1 chasman ufhpc 34M Nov 28 23:31 ./Tdtest915/ASSEMBLER_CS/_ASSEMBLER/_ASM_SN/SHARD_ASM/fork0/join/files/chunk233.sedge_asm
>> 
>> I assume the 24-byte file is metadata (an index or some such) and the 34M file is the actual data but I’m just guessing since I’m completely unfamiliar with the application.
>> 
>> The write error is,
>> 
>>     #define ENOTCONN        107     /* Transport endpoint is not connected */
>> 
>> which occurs after the OSS eviction.  This was reproducible under 2.5.3.90 as well.  We hoped that upgrading to 2.10.1 would resolve the issue but it has not.
>> 
>> This is the first application (in 10 years) we have encountered that consistently and reliably fails when run over Lustre.  I’m not sure at this point whether this is a bug or tuning issue.
>> If others have encountered and overcome something like this, we’d be grateful to hear from you.
>> 
>> Regards,
>> 
>> Charles Taylor
>> UF Research Computing
>> 
>> OSS:
>> --------------
>> Nov 28 23:41:41 ufrcoss28 kernel: LustreError: 0:0:(ldlm_lockd.c:334:waiting_locks_callback()) ### lock callback timer expired after 201s: evicing client at 10.13.136.74 at o2ib  ns: filter-testfs-OST002e_UUID lock: ffff880041717400/0x9bd23c8dc69323a1 lrc: 3/0,0 mode: PW/PW res: [0x7ef2:0x0:0x0].0x0 rrc: 3 type: EXT [0->18446744073709551615] (req 4096->1802239) flags: 0x60000400010020 nid: 10.13.136.74 at o2ib remote: 0xe54f26957f2ac591 expref: 45 pid: 6836 timeout: 6488120506 lvb_type: 0
>> 
>> Client:
>> ———————
>> Nov 28 23:41:42 s5a-s23 kernel: LustreError: 11-0: testfs-OST002e-osc-ffff88c053fe3800: operation ost_write to node 10.13.136.30 at o2ib failed: rc = -107
>> Nov 28 23:41:42 s5a-s23 kernel: Lustre: testfs-OST002e-osc-ffff88c053fe3800: Connection to testfs-OST002e (at 10.13.136.30 at o2ib) was lost; in progress operations using this service will wait for recovery to complete
>> Nov 28 23:41:42 s5a-s23 kernel: LustreError: 167-0: testfs-OST002e-osc-ffff88c053fe3800: This client was evicted by testfs-OST002e; in progress operations using this service will fail.
>> Nov 28 23:41:42 s5a-s23 kernel: LustreError: 11-0: testfs-OST002c-osc-ffff88c053fe3800: operation ost_punch to node 10.13.136.30 at o2ib failed: rc = -107
>> Nov 28 23:41:42 s5a-s23 kernel: Lustre: testfs-OST002c-osc-ffff88c053fe3800: Connection to testfs-OST002c (at 10.13.136.30 at o2ib) was lost; in progress operations using this service will wait for recovery to complete
>> Nov 28 23:41:42 s5a-s23 kernel: LustreError: 167-0: testfs-OST002c-osc-ffff88c053fe3800: This client was evicted by testfs-OST002c; in progress operations using this service will fail.
>> Nov 28 23:41:47 s5a-s23 kernel: LustreError: 11-0: testfs-OST0000-osc-ffff88c053fe3800: operation ost_statfs to node 10.13.136.23 at o2ib failed: rc = -107
>> Nov 28 23:41:47 s5a-s23 kernel: Lustre: testfs-OST0000-osc-ffff88c053fe3800: Connection to testfs-OST0000 (at 10.13.136.23 at o2ib) was lost; in progress operations using this service will wait for recovery to complete
>> Nov 28 23:41:47 s5a-s23 kernel: LustreError: 167-0: testfs-OST0004-osc-ffff88c053fe3800: This client was evicted by testfs-OST0004; in progress operations using this service will fail.
>> Nov 28 23:43:11 s5a-s23 kernel: Lustre: testfs-OST0006-osc-ffff88c053fe3800: Connection restored to 10.13.136.24 at o2ib (at 10.13.136.24 at o2ib)
>> Nov 28 23:43:38 s5a-s23 kernel: Lustre: testfs-OST002c-osc-ffff88c053fe3800: Connection restored to 10.13.136.30 at o2ib (at 10.13.136.30 at o2ib)
>> Nov 28 23:43:45 s5a-s23 kernel: Lustre: testfs-OST0000-osc-ffff88c053fe3800: Connection restored to 10.13.136.23 at o2ib (at 10.13.136.23 at o2ib)
>> Nov 28 23:43:48 s5a-s23 kernel: Lustre: testfs-OST0004-osc-ffff88c053fe3800: Connection restored to 10.13.136.23 at o2ib (at 10.13.136.23 at o2ib)
>> Nov 28 23:43:48 s5a-s23 kernel: Lustre: Skipped 3 previous similar messages
>> Nov 28 23:43:55 s5a-s23 kernel: Lustre: testfs-OST0007-osc-ffff88c053fe3800: Connection restored to 10.13.136.24 at o2ib (at 10.13.136.24 at o2ib)
>> Nov 28 23:43:55 s5a-s23 kernel: Lustre: Skipped 4 previous similar messages
>> 
>> Some Details:
>> -------------------
>> OS: RHEL 7.4 (Linux ufrcoss28.ufhpc 3.10.0-693.2.2.el7_lustre.x86_64)
>> Lustre: 2.10.1 (lustre-2.10.1-1.el7.x86_64)
>> Client: 2.10.1
>>      1 TB RAM
>>      Mellanox ConnectX-3 IB/VPI HCAs
>>      Linux s5a-s23.ufhpc 2.6.32-696.13.2.el6.x86_64
>>      MOFED 3.2.2 IB stack
>>      Lustre 2.10.1
>> Servers: 10 HA OSS pairs (20 OSSs)
>>    128 GB RAM
>>    6 OSTs (8+2 RAID-6) per OSS
>>    Mellanox ConnectX-3 IB/VPI HCAs
>>    RedHat EL7 Native IB Stack (i.e. not MOFED)
>> 
>> _______________________________________________
>> lustre-discuss mailing list
>> lustre-discuss at lists.lustre.org
>> https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.lustre.org_listinfo.cgi_lustre-2Ddiscuss-2Dlustre.org&d=DwIFaQ&c=pZJPUDQ3SB9JplYbifm4nt2lEVG5pWx2KikqINpWlZM&r=RICvIsKPoJQmARmfDi0-Z-RMIwI419LZJiXQQzcf9yw&m=I4rEme8mkg0eAl3Ty_aVrbvJbK8z4ZXS1ZoXhMksxnY&s=QzYZ-8wbcVa5ySpGh0IhdUqh3oV-eaI3W5gZ_Ymd7Xk&e= 
>> 
>> 
>> 
>> _______________________________________________
>> lustre-discuss mailing list
>> 
>> lustre-discuss at lists.lustre.org
>> https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.lustre.org_listinfo.cgi_lustre-2Ddiscuss-2Dlustre.org&d=DwIFaQ&c=pZJPUDQ3SB9JplYbifm4nt2lEVG5pWx2KikqINpWlZM&r=RICvIsKPoJQmARmfDi0-Z-RMIwI419LZJiXQQzcf9yw&m=I4rEme8mkg0eAl3Ty_aVrbvJbK8z4ZXS1ZoXhMksxnY&s=QzYZ-8wbcVa5ySpGh0IhdUqh3oV-eaI3W5gZ_Ymd7Xk&e= 
> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20171211/bd0e75f3/attachment-0001.html>


More information about the lustre-discuss mailing list