[lustre-discuss] client evictions

Strikwerda, Ger g.j.c.strikwerda at rug.nl
Wed Dec 19 06:02:40 PST 2018


Here at the University of Groningen we run a Lustre setup that has some
issues in client-nodes being evicted by the metadata-server:

Kernel: CentOS 7.5 3.10.0-862.2.3-lustre
Lustre: 2.10.4
Network IB/10 Gb Ethernet

logs client:

Dec 19 06:45:28 dh-node03 kernel: [1952901.506173] LustreError: 11-0:
dh3-MDT0000-mdc-ffff9f337a935000: operation ldlm_enqueue to node
172.23.53.205 at o2ib3 failed: rc = -107
Dec 19 06:45:28 dh-node03 kernel: [1952901.508610] Lustre:
dh3-MDT0000-mdc-ffff9f337a935000: Connection to dh3-MDT0000 (at
172.23.53.205 at o2ib3) was lost; in progress operations using this service
will wait for recovery to complete
Dec 19 06:45:28 dh-node03 kernel: [1952901.559429] LustreError: 167-0:
dh3-MDT0000-mdc-ffff9f337a935000: This client was evicted by dh3-MDT0000;
in progress operations using this service will fail.
Dec 19 06:45:28 dh-node03 kernel: [1952901.559678] LustreError:
29373:0:(file.c:172:ll_close_inode_openhandle())
dh3-clilmv-ffff9f337a935000: inode [0x200009e9e:0xfabd:0x0] mdc close
failed: rc = -5
Dec 19 06:45:28 dh-node03 kernel: [1952901.559681] LustreError:
29373:0:(file.c:172:ll_close_inode_openhandle()) Skipped 1 previous similar
message
Dec 19 06:45:28 dh-node03 kernel: [1952901.594335] LustreError:
27096:0:(lmv_obd.c:1250:lmv_fid_alloc()) Can't alloc new fid, rc -19
Dec 19 06:45:28 dh-node03 kernel: [1952901.627102] LustreError:
29477:0:(file.c:3644:ll_inode_revalidate_fini()) dh3: revalidate FID
[0x200009e9e:0xef54:0x0] error: rc = -108
Dec 19 06:45:29 dh-node03 kernel: [1952902.316568] LustreError:
29373:0:(file.c:172:ll_close_inode_openhandle())
dh3-clilmv-ffff9f337a935000: inode [0x200009e9e:0xfb45:0x0] mdc close
failed: rc = -108
Dec 19 06:45:29 dh-node03 kernel: [1952902.318931] LustreError:
29373:0:(file.c:172:ll_close_inode_openhandle()) Skipped 7 previous similar
messages

logs on metadata-server:

Dec 19 06:45:28 dh3-mds01 kernel: LustreError:
3883:0:(ldlm_lockd.c:697:ldlm_handle_ast_error()) ### client (nid
172.23.53.3 at o2ib3) faile
d to reply to blocking AST (req at ffff9887dc085100 x1606923837274448 status 0
rc -110), evict it ns: mdt-dh3-MDT0000_UUID lock: ffff988c196
02800/0x7477c37a8da4a564 lrc: 4/0,0 mode: PR/PR res:
[0x200009e9e:0xfb4c:0x0].0x0 bits 0x20 rrc: 3 type: IBT flags:
0x60200400000020 nid:
 172.23.53.3 at o2ib3 remote: 0xea19f3efd5f14578 expref: 5753603 pid: 44838
timeout: 17016180683 lvb_type: 0

Dec 19 06:45:28 dh3-mds01 kernel: LustreError: 138-a: dh3-MDT0000: A client
on nid 172.23.53.3 at o2ib3 was evicted due to a lock blocking c
allback time out: rc -110
Dec 19 06:45:28 dh3-mds01 kernel: Lustre: dh3-MDT0000: Connection restored
to ad5824f9-f876-01c5-a14b-5a22ddabed41 (at 172.23.53.3 at o2ib3)

Dec 19 06:48:49 dh3-mds01 kernel: LNet: Service thread pid 3883 was
inactive for 200.66s. The thread might be hung, or it might only be s
low and will resume later. Dumping the stack trace for debugging purposes:
Dec 19 06:48:49 dh3-mds01 kernel: LNet:
3187:0:(linux-debug.c:185:libcfs_call_trace()) can't show stack: kernel
doesn't export show_task
Dec 19 06:48:49 dh3-mds01 kernel: LustreError: dumping log to
/tmp/lustre-log.1545198529.3883

Dec 19 06:50:09 dh3-mds01 kernel: LNet: Service thread pid 172356 was
inactive for 200.15s. The thread might be hung, or it might only be
 slow and will resume later. Dumping the stack trace for debugging purposes:
Dec 19 06:50:09 dh3-mds01 kernel: LNet:
3187:0:(linux-debug.c:185:libcfs_call_trace()) can't show stack: kernel
doesn't export show_task
Dec 19 06:50:09 dh3-mds01 kernel: LustreError: dumping log to
/tmp/lustre-log.1545198609.172356

Dec 19 06:50:28 dh3-mds01 kernel: LustreError:
3883:0:(ldlm_request.c:130:ldlm_expired_completion_wait()) ### lock timed
out (enqueued at
 1545198328, 300s ago); not entering recovery in server code, just going
back to sleep ns: mdt-dh3-MDT0000_UUID lock: ffff988c3f913c00/0x
7477c37a8da4e2c0 lrc: 3/0,1 mode: --/EX res: [0x200009e9e:0xfb4c:0x0].0x0
bits 0x21 rrc: 3 type: IBT flags: 0x40210000000000 nid: local r
emote: 0x0 expref: -99 pid: 3883 timeout: 0 lvb_type: 0
Dec 19 06:50:28 dh3-mds01 kernel: LustreError: dumping log to
/tmp/lustre-log.1545198628.3883

Dec 19 06:51:49 dh3-mds01 kernel: LustreError:
172356:0:(ldlm_request.c:130:ldlm_expired_completion_wait()) ### lock timed
out (enqueued
at 1545198409, 300s ago); not entering recovery in server code, just going
back to sleep ns: mdt-dh3-MDT0000_UUID lock: ffff988c29834400/
0x7477c37a8df47a1c lrc: 3/0,1 mode: --/EX res: [0x200000004:0x1:0x0].0x0
bits 0x2 rrc: 3 type: IBT flags: 0x40210000000000 nid: local rem
ote: 0x0 expref: -99 pid: 172356 timeout: 0 lvb_type: 0

Possible we need to do some LNet tuning..  currently we don't have any
tuning set on clients/metadata/oss. Any pointers/hint/tricks/tips to point
us
in the right direction will be much appreciated!

-- 

Vriendelijke groet,

Ger StrikwerdaChef Special
Rijksuniversiteit Groningen
Centrum voor Informatie Technologie
Team HPC Beheer

Smitsborg
Nettelbosje 1
9747 AJ Groningen
Tel. 050 363 9276
"God is hard, God is fair
 some men he gave brains, others he gave hair"
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20181219/db8f4987/attachment.html>


More information about the lustre-discuss mailing list