[lustre-discuss] Lustre ldlm_lockd errors

Riccardo Veraldi Riccardo.Veraldi at cnaf.infn.it
Wed Apr 25 14:49:37 PDT 2018


Hello,
I am having a quite  serious problem with the lock manager.
First of all we are using Lustre 2.10.3 both on server and client side
on RHEL7.
The only difference beween servers and clients is that lustre OSSes have
kernel 4.4.126 while clients have stock RHEL7 kernel.
We have NVMe disks on the OSSes and kernel 4.4 mnages IRQ balancing for
NVMe disks much better.
it is possible to reproduce the problem.
I get this error during the simultaneous read and write. If I run the
writer and reader sequentially, the problem does not occur and
everything performs really well.
Unfortunately we need to write a file and have several threads reading
from it too.
So one big file is written and after a while multiple thread readers
access the file to read data (experimental data). This is the model of
our DAQ.
The specific failures are occurring in the read threads when they ask
for the file size (the call to os.stat() in python).
This is both to delay the start of the readers until the file exists and
to keep the reader from deadlocking the writer by repeatedly asking for
the data at the end of the file.
I do not know if there is a way to fix this. Apparently seems that
writing one file and having a bunch of threads reading form the same
file makes the lock manager unhappy in some way.
Any hints would be very appreciated. Thank you.

Errors OSS side:

Apr 25 10:31:19 drp-tst-ffb01 kernel: LustreError:
0:0:(ldlm_lockd.c:334:waiting_locks_callback()) ### lock callback timer
expired after 101s: evicting client at 172.21.52.131 at o2ib  ns:
filter-drpffb-OST0001_UUID lock: ffff88202010b600/0x5be7c3e66a45b63f
lrc: 3/0,0 mode: PR/PR res: [0x4ad:0x0:0x0].0x0 rrc: 4397 type: EXT
[0->18446744073709551615] (req 0->18446744073709551615) flags:
0x60000400010020 nid: 172.21.52.131 at o2ib remote: 0xc0c93433d781fff9
expref: 5 pid: 10804 timeout: 4774735450 lvb_type: 1
Apr 25 10:31:20 drp-tst-ffb01 kernel: LustreError:
9524:0:(ldlm_lockd.c:2365:ldlm_cancel_handler()) ldlm_cancel from
172.21.52.127 at o2ib arrived at 1524677480 with bad export cookie
6622477171464070609
Apr 25 10:31:20 drp-tst-ffb01 kernel: Lustre: drpffb-OST0001: Connection
restored to 23bffb9d-10bd-0603-76f6-e2173f99e3c6 (at 172.21.52.127 at o2ib)
Apr 25 10:31:20 drp-tst-ffb01 kernel: Lustre: Skipped 65 previous
similar messages


Errors client side:

Apr 25 10:31:19 drp-tst-acc06 kernel: Lustre:
drpffb-OST0002-osc-ffff880167fda800: Connection to drpffb-OST0002 (at
172.21.52.84 at o2ib) was lost; in progress operations using this service
will wait for recovery to complete
Apr 25 10:31:19 drp-tst-acc06 kernel: Lustre: Skipped 1 previous similar
message
Apr 25 10:31:19 drp-tst-acc06 kernel: LustreError: 167-0:
drpffb-OST0002-osc-ffff880167fda800: This client was evicted by
drpffb-OST0002; in progress operations using this service will fail.
Apr 25 10:31:22 drp-tst-acc06 kernel: LustreError: 11-0:
drpffb-OST0001-osc-ffff880167fda800: operation ost_statfs to node
172.21.52.83 at o2ib failed: rc = -107
Apr 25 10:31:22 drp-tst-acc06 kernel: Lustre:
drpffb-OST0001-osc-ffff880167fda800: Connection to drpffb-OST0001 (at
172.21.52.83 at o2ib) was lost; in progress operations using this service
will wait for recovery to complete
Apr 25 10:31:22 drp-tst-acc06 kernel: LustreError: 167-0:
drpffb-OST0001-osc-ffff880167fda800: This client was evicted by
drpffb-OST0001; in progress operations using this service will fail.
Apr 25 10:31:22 drp-tst-acc06 kernel: LustreError:
59702:0:(ldlm_resource.c:1100:ldlm_resource_complain())
drpffb-OST0001-osc-ffff880167fda800: namespace resource
[0x4ad:0x0:0x0].0x0 (ffff881004af6e40) refcount nonzero (1) after lock
cleanup; forcing cleanup.
Apr 25 10:31:22 drp-tst-acc06 kernel: LustreError:
59702:0:(ldlm_resource.c:1682:ldlm_resource_dump()) --- Resource:
[0x4ad:0x0:0x0].0x0 (ffff881004af6e40) refcount = 2
Apr 25 10:31:22 drp-tst-acc06 kernel: LustreError:
59702:0:(ldlm_resource.c:1682:ldlm_resource_dump()) --- Resource:
[0x4ad:0x0:0x0].0x0 (ffff881004af6e40) refcount = 2


some other info that can be useful:

# lctl get_param  llite.*.max_cached_mb
llite.drpffb-ffff880167fda800.max_cached_mb=
users: 5
max_cached_mb: 64189
used_mb: 9592
unused_mb: 54597
reclaim_count: 0
llite.drplu-ffff881fe1f99000.max_cached_mb=
users: 8
max_cached_mb: 64189
used_mb: 0
unused_mb: 64189
reclaim_count: 0

# lctl get_param ldlm.namespaces.*.lru_size
ldlm.namespaces.MGC172.21.42.159 at tcp.lru_size=1600
ldlm.namespaces.MGC172.21.42.213 at tcp.lru_size=1600
ldlm.namespaces.drpffb-MDT0000-mdc-ffff880167fda800.lru_size=3
ldlm.namespaces.drpffb-OST0001-osc-ffff880167fda800.lru_size=0
ldlm.namespaces.drpffb-OST0002-osc-ffff880167fda800.lru_size=2
ldlm.namespaces.drpffb-OST0003-osc-ffff880167fda800.lru_size=0
ldlm.namespaces.drplu-MDT0000-mdc-ffff881fe1f99000.lru_size=0
ldlm.namespaces.drplu-OST0001-osc-ffff881fe1f99000.lru_size=0
ldlm.namespaces.drplu-OST0002-osc-ffff881fe1f99000.lru_size=0
ldlm.namespaces.drplu-OST0003-osc-ffff881fe1f99000.lru_size=0
ldlm.namespaces.drplu-OST0004-osc-ffff881fe1f99000.lru_size=0
ldlm.namespaces.drplu-OST0005-osc-ffff881fe1f99000.lru_size=0
ldlm.namespaces.drplu-OST0006-osc-ffff881fe1f99000.lru_size=0



ldlm.namespaces.MGC172.21.42.159 at tcp.lru_size=1600
ldlm.namespaces.MGC172.21.42.213 at tcp.lru_size=1600
ldlm.namespaces.drpffb-MDT0000-mdc-ffff880167fda800.lru_size=3
ldlm.namespaces.drpffb-OST0001-osc-ffff880167fda800.lru_size=0
ldlm.namespaces.drpffb-OST0002-osc-ffff880167fda800.lru_size=2
ldlm.namespaces.drpffb-OST0003-osc-ffff880167fda800.lru_size=0
ldlm.namespaces.drplu-MDT0000-mdc-ffff881fe1f99000.lru_size=0
ldlm.namespaces.drplu-OST0001-osc-ffff881fe1f99000.lru_size=0
ldlm.namespaces.drplu-OST0002-osc-ffff881fe1f99000.lru_size=0
ldlm.namespaces.drplu-OST0003-osc-ffff881fe1f99000.lru_size=0
ldlm.namespaces.drplu-OST0004-osc-ffff881fe1f99000.lru_size=0
ldlm.namespaces.drplu-OST0005-osc-ffff881fe1f99000.lru_size=0
ldlm.namespaces.drplu-OST0006-osc-ffff881fe1f99000.lru_size=0


Rick





More information about the lustre-discuss mailing list