[lustre-discuss] Files hanging on lustre clients

Tue Mar 31 11:36:32 PDT 2020

Hello,

   I'm tracking a very vexing issue.  Somehow users are creating files that cause attempts to examine or manipulate the files to hang.  They can't be removed, they can't be examined, even an ls command will hang.

an strace on an ls command run against some of these files produced the following:

getxattr("/volatile/halld/home/haoli/RunPeriod-2017-01/analysis/ver36_Mar27/log/030408/stdout.030408_124.out", "system.posix_acl_default", NULL, 0) = -1 ENODATA (No data available)

lstat("/volatile/halld/home/haoli/RunPeriod-2017-01/analysis/ver36_Mar27/log/030408/stderr.030408_118.err", {st_mode=S_IFREG|0644, st_size=16979, ...}) = 0

getxattr("/volatile/halld/home/haoli/RunPeriod-2017-01/analysis/ver36_Mar27/log/030408/stderr.030408_118.err", "system.posix_acl_access", NULL, 0) = -1 ENODATA (No data available)

getxattr("/volatile/halld/home/haoli/RunPeriod-2017-01/analysis/ver36_Mar27/log/030408/stderr.030408_118.err", "system.posix_acl_default", NULL, 0) = -1 ENODATA (No data available)

lstat("/volatile/halld/home/haoli/RunPeriod-2017-01/analysis/ver36_Mar27/log/030408/stdout.030408_000.out",

On one client that was being used to debug this problem I see the following (running CentOS 6.5 and luster 2.5.42)
Lustre: 2753:0:(client.c:1920:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1585677903/real 1585677903]  req at ffff880238fdd000 x1657754295856608/t0(0) o101->lustre19-OST0028-osc-ffff88105fecd000 at 172.17.0.99@o2ib:28/4 lens 328/400 e 1 to 1 dl 1585678504 ref 1 fl Rpc:X/0/ffffffff rc 0/-1
Lustre: lustre19-OST0028-osc-ffff88105fecd000: Connection to lustre19-OST0028 (at 172.17.0.99 at o2ib) was lost; in progress operations using this service will wait for recovery to complete
Lustre: lustre19-OST0028-osc-ffff88105fecd000: Connection restored to lustre19-OST0028 (at 172.17.0.99 at o2ib)
Lustre: 2733:0:(client.c:304:ptlrpc_at_adj_net_latency()) Reported service time 465 > total measured time 96
LustreError: 2733:0:(layout.c:2005:__req_capsule_get()) @@@ Wrong buffer for field `dlm_rep' (1 of 1) in format `LDLM_ENQUEUE_LVB': 0 vs. 112 (server)
  req at ffff8802149c5800 x1657754295862208/t0(0) o101->lustre19-OST0028-osc-ffff88105fecd000 at 172.17.0.99@o2ib:28/4 lens 328/192 e 0 to 0 dl 1585679147 ref 1 fl Interpret:R/2/0 rc 0/0
LustreError: 2733:0:(layout.c:2005:__req_capsule_get()) Skipped 1 previous similar message

On newer systems (RHEL 7.7 running 2.10.8-1)  the problem also occurs and I see the following in dmesg
[5437818.727792] Lustre: lustre19-OST0028-osc-ffff886229244000: Connection restored to 172.17.0.99 at o2ib (at 172.17.0.99 at o2ib)
[5438419.769959] Lustre: 2747:0:(client.c:2116:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1585678345/real 1585678345]  req at ffff884834ec0600 x1657002677904272/t0(0) o101->lustre19-OST0028-osc-ffff886229244000 at 172.17.0.99@o2ib:28/4 lens 344/400 e 24 to 1 dl 1585678946 ref 1 fl Rpc:X/2/ffffffff rc -11/-1
[5438419.769973] Lustre: lustre19-OST0028-osc-ffff886229244000: Connection to lustre19-OST0028 (at 172.17.0.99 at o2ib) was lost; in progress operations using this service will wait for recovery to complete
[5438419.770435] Lustre: lustre19-OST0028-osc-ffff886229244000: Connection restored to 172.17.0.99 at o2ib (at 172.17.0.99 at o2ib)

df and lfs df commands on these systems aren't hanging.

on the OSS we are running: lustre 2.12.1-1 and CentOS 7.6) I'm seeing the following:
[Mar31 14:26] LNet: Service thread pid 309349 was inactive for 1121.43s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
[  +0.000007] Pid: 309349, comm: ll_ost00_088 3.10.0-957.10.1.el7_lustre.x86_64 #1 SMP Tue Apr 30 22:18:15 UTC 2019
[  +0.000002] Call Trace:
[  +0.000014]  [<ffffffffc130d560>] ptlrpc_set_wait+0x500/0x8d0 [ptlrpc]
[  +0.000074]  [<ffffffffc12cafe5>] ldlm_run_ast_work+0xd5/0x3a0 [ptlrpc]
[  +0.000039]  [<ffffffffc12ec54b>] ldlm_glimpse_locks+0x3b/0x100 [ptlrpc]
[  +0.000045]  [<ffffffffc16a63fb>] ofd_intent_policy+0x69b/0x920 [ofd]
[  +0.000015]  [<ffffffffc12cbd26>] ldlm_lock_enqueue+0x366/0xa60 [ptlrpc]
[  +0.000038]  [<ffffffffc12f4587>] ldlm_handle_enqueue0+0xa47/0x15a0 [ptlrpc]
[  +0.000046]  [<ffffffffc137c502>] tgt_enqueue+0x62/0x210 [ptlrpc]
[  +0.000072]  [<ffffffffc1380e5a>] tgt_request_handle+0xaea/0x1580 [ptlrpc]
[  +0.000065]  [<ffffffffc132680b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
[  +0.000056]  [<ffffffffc132a13c>] ptlrpc_main+0xafc/0x1fc0 [ptlrpc]
[  +0.000054]  [<ffffffff8dac1c71>] kthread+0xd1/0xe0
[  +0.000007]  [<ffffffff8e175c1d>] ret_from_fork_nospec_begin+0x7/0x21
[  +0.000009]  [<ffffffffffffffff>] 0xffffffffffffffff
[  +0.000036] LustreError: dumping log to /tmp/lustre-log.1585679216.309349

Thank you,

Kurt J. Strosahl
System Administrator: Lustre, HPC
Scientific Computing Group, Thomas Jefferson National Accelerator Facility
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20200331/c4a09f14/attachment-0001.html>