[Lustre-discuss] lock callback timer expired: evicting client
Per Lundqvist
perl at nsc.liu.se
Fri Dec 14 08:41:27 PST 2007
Recently we upgraded all of our lustre clients from 1.4.11.1 to 1.6.3
while remaining on 1.4.11.1 on all the MDS/OSS:es (which will be upgraded
later on). This has resulted in a never previously observed error when
running a specific application (ccsm3) on this lustre filesystem: Opening
the same file from a handful of compute nodes results in some/all of the
corresponding lustre clients getting evicted and the open system call to
fail.
When triggered, any processes from any node that tries to stat the
directory where this file is opened will also get evicted and freeze for
100 s. (I include output from /var/log/messages last in this mail.,
apologies for the long mail...).
I am interested in knowing whether this problem is only triggered when
using 1.6.3 clients against 1.4.11.1 servers, or if it might be triggered
for more similar client/server versions? Or if there is any way to prevent
this from happening without upgrading servers or downgrading clients. (I
will later try to reproduce this using 1.4.11.1 clients)
Our setup:
1 MDS + 3 OSS: Centos4 with 2.6.9-55.0.9.EL_lustre.1.4.11.1smp
clients: Centos4 with 2.6.9-55.0.9.EL_lustre.1.6.3smp
(Lustre over Gigabit ethernet, no failover)
We have narrowed this down to 2 similar examples which both triggers this
error:
one in fortran:
implicit none
integer i
do i=1,1000
open(unit=16,file='tmpfile')
close(16)
enddo
stop
end
and one in C:
#include <fcntl.h>
#include <unistd.h>
#include <stdio.h>
#include <stdlib.h>
int main(void) {
int i, fd;
char cwd[4096];
char tmpfile[4196];
for (i = 0; i < 1000; i++) {
if (getcwd(cwd, 4096) == NULL)
exit(-1);
sprintf(tmpfile,"%s/tmpfile", cwd);
fd = open(tmpfile, O_RDWR|O_CREAT, 0666);
close(fd);
}
return 0;
}
We can trigger the error by executing one of the examples on at least 3
nodes in parallel. i.e.
1. copy the binary to its own directory on the lustre filesystem.
2. execute the binary in parallel on at least 3 nodes.
Example:
gcc evict.c -o evict
mkdir /lustrefs/tmpdir
cp evict /lustrefs/tmpdir
pdsh -f3 -w n[1-3] 'cd /lustrefs/tmpdir; ./evict'
/var/log/messages on evicted client:
Dec 14 16:31:26 n75 kernel: LustreError: 11-0: an error occurred while communicating with 192.168.11.226 at tcp. The ldlm_enqueue operation failed with -107
Dec 14 16:31:26 n75 kernel: Lustre: MDC_mds2_misu2_mds_MNT_misu2_client-000001007e011400: Connection to service misu2_mds via nid 192.168.11.226 at tcp was lost; in progress operations us
ing this service will wait for recovery to complete.
Dec 14 16:31:26 n75 kernel: LustreError: 167-0: This client was evicted by misu2_mds; in progress operations using this service will fail.
Dec 14 16:31:26 n75 kernel: LustreError: 5484:0:(mdc_locks.c:423:mdc_finish_enqueue()) ldlm_cli_enqueue: -5
Dec 14 16:31:26 n75 kernel: LustreError: 5484:0:(client.c:519:ptlrpc_import_delay_req()) @@@ IMP_INVALID req at 000001007dc87400 x854/t0 o101->misu2_mds_UUID at mds2_UUID:12 lens 392/728 re
f 1 fl Rpc:/0/0 rc 0/0
Dec 14 16:31:26 n75 kernel: Lustre: MDC_mds2_misu2_mds_MNT_misu2_client-000001007e011400: Connection restored to service misu2_mds using nid 192.168.11.226 at tcp.
Dec 14 16:33:06 n75 kernel: LustreError: 11-0: an error occurred while communicating with 192.168.11.226 at tcp. The ldlm_enqueue operation failed with -107
Dec 14 16:33:06 n75 kernel: Lustre: MDC_mds2_misu2_mds_MNT_misu2_client-000001007e011400: Connection to service misu2_mds via nid 192.168.11.226 at tcp was lost; in progress operations us
ing this service will wait for recovery to complete.
Dec 14 16:33:06 n75 kernel: LustreError: 167-0: This client was evicted by misu2_mds; in progress operations using this service will fail.
Dec 14 16:33:06 n75 kernel: LustreError: 5484:0:(mdc_locks.c:423:mdc_finish_enqueue()) ldlm_cli_enqueue: -5
Dec 14 16:33:06 n75 kernel: LustreError: 5484:0:(mdc_locks.c:423:mdc_finish_enqueue()) Skipped 1 previous similar message
Dec 14 16:33:06 n75 kernel: Lustre: MDC_mds2_misu2_mds_MNT_misu2_client-000001007e011400: Connection restored to service misu2_mds using nid 192.168.11.226 at tcp.
....<snip>..
/var/log/messages on MDS:
Dec 14 16:31:26 mds2 kernel: LustreError: 0:0:(ldlm_lockd.c:205:waiting_locks_callback()) ### lock callback timer expired: evicting client e689a9c4-2a46-9239-7fd7-5a7e2c8c6542 at NET_0x20000c0a80b4b_UUID nid 192.168.11.75 at tcp ns: mds-misu2_mds_UUID lock: 0000010075f50700/0x959917d3f3b88931 lrc: 1/0,0 mode: CR/CR res: 4780751/2726154257 bits 0x3 rrc: 7 type: IBT flags: 30 remote: 0x2a6036f763199bff expref: 7 pid 5373
Dec 14 16:31:26 mds2 kernel: LustreError: 0:0:(ldlm_lockd.c:205:waiting_locks_callback()) Skipped 1 previous similar message
Dec 14 16:31:26 mds2 kernel: Lustre: 5376:0:(mds_reint.c:125:mds_finish_transno()) commit transaction for disconnected client e689a9c4-2a46-9239-7fd7-5a7e2c8c6542: rc 0
Dec 14 16:31:26 mds2 kernel: LustreError: 5394:0:(handler.c:1478:mds_handle()) operation 101 on unconnected MDS from 12345-192.168.11.75 at tcp
Dec 14 16:31:26 mds2 kernel: LustreError: 5394:0:(handler.c:1478:mds_handle()) Skipped 1 previous similar message
Dec 14 16:31:26 mds2 kernel: LustreError: 5394:0:(ldlm_lib.c:1343:target_send_reply_msg()) @@@ processing error (-107) req at 000001012551c450 x852/t0 o101-><?>@<?>:-1 lens 392/0 ref 0 fl Interpret:/0/0 rc -107/0
Dec 14 16:31:26 mds2 kernel: LustreError: 5394:0:(ldlm_lib.c:1343:target_send_reply_msg()) Skipped 2 previous similar messages
Dec 14 16:33:06 mds2 kernel: LustreError: 0:0:(ldlm_lockd.c:205:waiting_locks_callback()) ### lock callback timer expired: evicting client e689a9c4-2a46-9239-7fd7-5a7e2c8c6542 at NET_0x20000c0a80b4b_UUID nid 192.168.11.75 at tcp ns: mds-misu2_mds_UUID lock: 00000100b829fb00/0x959917d3f3b89ff1 lrc: 1/0,0 mode: CR/CR res: 4780751/2726154257 bits 0x3 rrc: 8 type: IBT flags: 30 remote: 0x2a6036f763199c3e expref: 6 pid 5398
Dec 14 16:33:06 mds2 kernel: Lustre: 5373:0:(mds_reint.c:125:mds_finish_transno()) commit transaction for disconnected client e689a9c4-2a46-9239-7fd7-5a7e2c8c6542: rc 0
Dec 14 16:33:06 mds2 kernel: LustreError: 5372:0:(handler.c:1478:mds_handle()) operation 101 on unconnected MDS from 12345-192.168.11.75 at tcp
Dec 14 16:33:06 mds2 kernel: LustreError: 5372:0:(ldlm_lib.c:1343:target_send_reply_msg()) @@@ processing error (-107) req at 0000010027297800 x927/t0 o101-><?>@<?>:-1 lens 392/0 ref 0 fl Interpret:/0/0 rc -107/0
....<snip>...
Thanks for any help/input,
/Per Lundqvist
--
Per Lundqvist
National Supercomputer Centre
Linköping University, Sweden
http://www.nsc.liu.se
More information about the lustre-discuss
mailing list