[Lustre-discuss] lock timeouts and OST evictions on 1.4 server - 1.6 client system.

Tue Feb 10 02:17:34 PST 2009

Were seeing problems which seem to be lock-related and result in data
loss. This is a fairly low probability event: it's happening on a ~1000
core compute farm which is heavily loaded and the frequency is of the
order of tens of failures an hour.

The clients are running kernel 2.6.22 and lustre 1.6.6 from Sun. The
servers are HP SFS and running lustre 1.4.11. The mixed versions are
deliberate: we're in the process of adding servers running 1.6.6 but
none are currently in use: all servers are 1.4.11. 1.4.11 clients don't
see problems and there are some old 1.4.11 clients left still using
these filesystems at teh same time as the new 1.6.6 ones.

On the server (OST), and event typically looks like this:

2009/02/09 14:05:10 kern    i   sf2-sfs33: LustreError:
0:0:(ldlm_lockd.c:227:waiting_locks_callback()) ### lock callback timer
expired: evicting client
4cdd29e6-8b9e-7e81-5d72-2a055f0e2a38 at NET_0x20000ac1f6eee_UUID nid
172.31.110.238 at tcp ns: filter-sf2-sfs-ost495_UUID lock:
0000010048ae0100/0x5fd2cf129ee3e5bb lrc: 1/0 , 0 mode: PW/PW res:
1542868/0 rrc: 4 type: EXT [0->18446744073709551615] (req
0->18446744073709551615) flags: 20 remote: 0xc8fbbbbcb3ab0d8d expref:
386 pid: 16626
2009/02/09 14:05:30 kern    i   sf2-sfs33: LustreError:
16577:0:(ldlm_lib.c:1364:target_send_reply_msg()) @@@ processing error
(-107) req at 0000010073bbce00 x2138818/t0 o400-><?>@<?>:-1 lens 64/0 ref 0
fl Interpret:/0/0 rc -107/0

and on the client at the same time:

Feb  9 14:05:30 sf-2-3-10 kernel: LustreError: 11-0: an error occurred
while communicating with 172.31.96.96 at tcp. The obd_ping operation failed
with -107
Feb  9 14:05:30 sf-2-3-10 kernel: LustreError: Skipped 12 previous
similar messages
Feb  9 14:05:30 sf-2-3-10 kernel: Lustre:
OSC_sf2-sfs2.internal.sanger.ac.uk_sf2-sfs-ost495_MNT_client_tcp-ffff81021f897800: 

Connection to service sf2-sfs-ost495 via nid 172.31.96.96 at tcp was lost;
in progress operations using this service will wait for recovery to
complete.
Feb  9 14:05:30 sf-2-3-10 kernel: Lustre: Skipped 4 previous similar
messages
Feb  9 14:05:30 sf-2-3-10 kernel: LustreError: 167-0: This client was
evicted by sf2-sfs-ost495; in progress operations using this service
will fail.

It gets worse: the same client s few minutes later saw this on another
OST, same filesystem.

2009/02/09 14:08:30 kern    i   sf2-sfs34: LustreError:
0:0:(ldlm_lockd.c:227:waiting_locks_callback()) ### lock callback timer
expired: evicting client
4cdd29e6-8b9e-7e81-5d72-2a055f0e2a38 at NET_0x20000ac1f6eee_UUID nid
172.31.110.238 at tcp ns: filter-sf2-sfs-ost496_UUID lock:
000001006b8bfa40/0x900da6275f064529 lrc: 1/0 , 0 mode: PW/PW res:
1542328/0 rrc: 2 type: EXT [0->18446744073709551615] (req
0->18446744073709551615) flags: 20 remote: 0xc8fbbbbcb3ab1478 expref:
172 pid: 16089
2009/02/09 14:08:30 kern    i   sf2-sfs33: LustreError:
0:0:(ldlm_lockd.c:227:waiting_locks_callback()) ### lock callback timer
expired: evicting client
f3275d1e-1360-e139-e71c-44b283d7ec3c at NET_0x20000ac1f665d_UUID nid
172.31.102.93 at tcp ns: filter-sf2-sfs-ost495_UUID lock:
000001005a810040/0x5fd2cf129ee3e9f1 lrc: 1/0 , 0 mode: PW/PW res:
1542868/0 rrc: 3 type: EXT [0->18446744073709551615] (req
0->18446744073709551615) flags: 20 remote: 0xa899b6e20a06a83e expref:
316 pid: 16454
2009/02/09 14:08:30 kern    i   sf2-sfs33: LustreError:
16335:0:(ldlm_lockd.c:606:ldlm_server_completion_ast()) ### enqueue wait
took 399491082us from 1234188111 ns: filter-sf2-sfs-ost495_UUID lock:
0000010064a638c0/0x5fd2cf129ee3ea68 lrc: 2/0 , 0 mode: PW/PW res:
1542868/0 rrc: 3 type: EXT [0->18446744073709551615] (req
0->18446744073709551615) flags: 20 remote: 0xc8423e042b195e36 expref:
501 pid: 16493
2009/02/09 14:08:32 storage w   sf2-sfs1 : Warning: event log
(/var/hpls/evlog/eventlog) size is 77760824
2009/02/09 14:08:50 kern    i   sf2-sfs34: LustreError:
16253:0:(ldlm_lib.c:1364:target_send_reply_msg()) @@@ processing error
(-107) req at 00000100b3ea4a00 x2150055/t0 o400-><?>@<?>:-1 lens 64/0 ref 0
fl Interpret:/0/0 rc -107/0
2009/02/09 14:09:02 kern    i   sf2-sfs33: LustreError:
16566:0:(ldlm_lib.c:1364:target_send_reply_msg()) @@@ processing error
(-107) req at 000001006b7b3200 x5180156/t0 o400-><?>@<?>:-1 lens 64/0 ref 0
fl Interpret:/0/0 rc -107/0

whilst the client suffered an assertion failure and LBUG.

Feb  9 14:08:50 sf-2-3-10 kernel: LustreError: 11-0: an error occurred
while communicating with 172.31.96.97 at tcp. The obd_ping operation failed
with -107
Feb  9 14:08:50 sf-2-3-10 kernel: Lustre:
OSC_sf2-sfs2.internal.sanger.ac.uk_sf2-sfs-ost496_MNT_client_tcp-ffff81021f897800: 

Connection to service sf2-sfs-ost496 via nid 172.31.96.97 at tcp was lost;
in progress operations using this service will wait for recovery to
complete.
Feb  9 14:08:50 sf-2-3-10 kernel: LustreError: 167-0: This client was
evicted by sf2-sfs-ost496; in progress operations using this service
will fail.
Feb  9 14:08:50 sf-2-3-10 kernel: LustreError:
29564:0:(import.c:241:ptlrpc_invalidate_import()) sf2-sfs-ost495_UUID:
rc = -110 waiting for callback (1 != 0)
Feb  9 14:08:50 sf-2-3-10 kernel: LustreError:
29564:0:(import.c:252:ptlrpc_invalidate_import())
ASSERTION(atomic_read(&imp->imp_inflight) == 0) failed
Feb  9 14:08:50 sf-2-3-10 kernel: Lustre:
OSC_sf2-sfs2.internal.sanger.ac.uk_sf2-sfs-ost496_MNT_client_tcp-ffff81021f897800: 

Connection restored to service sf2-sfs-ost496 using nid 172.31.96.97 at tcp.
Feb  9 14:08:50 sf-2-3-10 kernel: Lustre: Skipped 4 previous similar
messages
Feb  9 14:08:50 sf-2-3-10 kernel: LustreError:
29564:0:(tracefile.c:450:libcfs_assertion_failed()) LBUG
Feb  9 14:08:50 sf-2-3-10 kernel: Lustre:
29564:0:(linux-debug.c:185:libcfs_debug_dumpstack()) showing stack for
process 29564
Feb  9 14:08:50 sf-2-3-10 kernel: ll_imp_inval  R  running task       0
29564      2 (L-TLB)
Feb  9 14:08:50 sf-2-3-10 kernel: ffffffff804b1d28 ffffffff804b1d28
ffffffff804b1d28 0000000000000000
Feb  9 14:08:50 sf-2-3-10 kernel: ffffffff8856cf5f ffff8101a231fcc0
0000003000000030 ffff8101b8079e90
Feb  9 14:08:50 sf-2-3-10 kernel: ffff8101b8079d70 ffffffff8025d587
0000000000000286 ffffffff8025d5a9
Feb  9 14:08:50 sf-2-3-10 kernel: Call Trace:
Feb  9 14:08:50 sf-2-3-10 kernel: [<ffffffff8020b2f0>] show_trace+0x34/0x47
Feb  9 14:08:50 sf-2-3-10 kernel: [<ffffffff8020b3fc>] _show_stack+0xe2/0xf1
Feb  9 14:08:50 sf-2-3-10 kernel: [<ffffffff882f7b7c>]
:libcfs:lbug_with_loc+0x6c/0xd0
Feb  9 14:08:50 sf-2-3-10 kernel: [<ffffffff882ffbc4>]
:libcfs:libcfs_assertion_failed+0x54/0x60
Feb  9 14:08:50 sf-2-3-10 kernel: [<ffffffff88451785>]
:ptlrpc:ptlrpc_invalidate_import+0x4d5/0x5c0
Feb  9 14:08:50 sf-2-3-10 kernel: [<ffffffff8022c500>]
default_wake_function+0x0/0xe
Feb  9 14:08:50 sf-2-3-10 kernel: [<ffffffff80292e8c>] dput+0x1c/0x10b
Feb  9 14:08:50 sf-2-3-10 kernel: [<ffffffff88453ce9>]
:ptlrpc:ptlrpc_invalidate_import_thread+0x49/0x2e0
Feb  9 14:08:50 sf-2-3-10 kernel: [<ffffffff8020aba8>] child_rip+0xa/0x12
Feb  9 14:08:50 sf-2-3-10 kernel: [<ffffffff88453ca0>]
:ptlrpc:ptlrpc_invalidate_import_thread+0x0/0x2e0
Feb  9 14:08:50 sf-2-3-10 kernel: [<ffffffff8020ab9e>] child_rip+0x0/0x12
Feb  9 14:08:50 sf-2-3-10 kernel:

Userspace file operations terminate with an IO error when these events
happen. The failed operations seem to be writes to files which are
long-open (spool files taking job output).

We are also seeing some userspace file operations fail with the error
"No locks available". These don't generate any logging on the client so
I don't have exact timing. It's possible that they are associated with
further "### lock callback timer expired" server logs.

OSTs have the lustre timeout set to 200 seconds. System load is not
excessive and no greater than has been supported for more than a year
with 1.4 clients.

I'd be very happy to have any clues on fixing this....

Cheers,

Simon.