[Lustre-discuss] lock callback timer expired: evicting client

Per Lundqvist perl at nsc.liu.se
Fri Dec 14 08:41:27 PST 2007


Recently we upgraded all of our lustre clients from 1.4.11.1 to 1.6.3 
while remaining on 1.4.11.1 on all the MDS/OSS:es (which will be upgraded 
later on). This has resulted in a never previously observed error when 
running a specific application (ccsm3) on this lustre filesystem: Opening 
the same file from a handful of compute nodes results in some/all of the 
corresponding lustre clients getting evicted and the open system call to 
fail.

When triggered, any processes from any node that tries to stat the 
directory where this file is opened will also get evicted and freeze for 
100 s. (I include output from /var/log/messages last in this mail., 
apologies for the long mail...).

I am interested in knowing whether this problem is only triggered when 
using 1.6.3 clients against 1.4.11.1 servers, or if it might be triggered 
for more similar client/server versions? Or if there is any way to prevent 
this from happening without upgrading servers or downgrading clients. (I 
will later try to reproduce this using 1.4.11.1 clients)

Our setup:
   1 MDS + 3 OSS: Centos4 with 2.6.9-55.0.9.EL_lustre.1.4.11.1smp
   clients:       Centos4 with 2.6.9-55.0.9.EL_lustre.1.6.3smp
   (Lustre over Gigabit ethernet, no failover)

We have narrowed this down to 2 similar examples which both triggers this 
error:

one in fortran:
      implicit none
      integer i

      do i=1,1000
         open(unit=16,file='tmpfile')
         close(16)
      enddo
      stop
      end

and one in C:
   #include <fcntl.h>
   #include <unistd.h>
   #include <stdio.h>
   #include <stdlib.h>
   
   int main(void) {
     int i, fd;
     char cwd[4096];
     char tmpfile[4196];
   
     for (i = 0; i < 1000; i++) {
       if (getcwd(cwd, 4096) == NULL)
         exit(-1);
       sprintf(tmpfile,"%s/tmpfile", cwd);
       fd = open(tmpfile, O_RDWR|O_CREAT, 0666);
       close(fd);
     }
     return 0;
   }

We can trigger the error by executing one of the examples on at least 3 
nodes in parallel. i.e.
   1. copy the binary to its own directory on the lustre filesystem.
   2. execute the binary in parallel on at least 3 nodes.

Example:
   gcc evict.c -o evict
   mkdir /lustrefs/tmpdir
   cp evict /lustrefs/tmpdir
   pdsh -f3 -w n[1-3] 'cd /lustrefs/tmpdir; ./evict'

/var/log/messages on evicted client:
   Dec 14 16:31:26 n75 kernel: LustreError: 11-0: an error occurred while communicating with 192.168.11.226 at tcp. The ldlm_enqueue operation failed with -107
   Dec 14 16:31:26 n75 kernel: Lustre: MDC_mds2_misu2_mds_MNT_misu2_client-000001007e011400: Connection to service misu2_mds via nid 192.168.11.226 at tcp was lost; in progress operations us
   ing this service will wait for recovery to complete.
   Dec 14 16:31:26 n75 kernel: LustreError: 167-0: This client was evicted by misu2_mds; in progress operations using this service will fail.
   Dec 14 16:31:26 n75 kernel: LustreError: 5484:0:(mdc_locks.c:423:mdc_finish_enqueue()) ldlm_cli_enqueue: -5
   Dec 14 16:31:26 n75 kernel: LustreError: 5484:0:(client.c:519:ptlrpc_import_delay_req()) @@@ IMP_INVALID  req at 000001007dc87400 x854/t0 o101->misu2_mds_UUID at mds2_UUID:12 lens 392/728 re
   f 1 fl Rpc:/0/0 rc 0/0
   Dec 14 16:31:26 n75 kernel: Lustre: MDC_mds2_misu2_mds_MNT_misu2_client-000001007e011400: Connection restored to service misu2_mds using nid 192.168.11.226 at tcp.
   Dec 14 16:33:06 n75 kernel: LustreError: 11-0: an error occurred while communicating with 192.168.11.226 at tcp. The ldlm_enqueue operation failed with -107
   Dec 14 16:33:06 n75 kernel: Lustre: MDC_mds2_misu2_mds_MNT_misu2_client-000001007e011400: Connection to service misu2_mds via nid 192.168.11.226 at tcp was lost; in progress operations us
   ing this service will wait for recovery to complete.
   Dec 14 16:33:06 n75 kernel: LustreError: 167-0: This client was evicted by misu2_mds; in progress operations using this service will fail.
   Dec 14 16:33:06 n75 kernel: LustreError: 5484:0:(mdc_locks.c:423:mdc_finish_enqueue()) ldlm_cli_enqueue: -5
   Dec 14 16:33:06 n75 kernel: LustreError: 5484:0:(mdc_locks.c:423:mdc_finish_enqueue()) Skipped 1 previous similar message
   Dec 14 16:33:06 n75 kernel: Lustre: MDC_mds2_misu2_mds_MNT_misu2_client-000001007e011400: Connection restored to service misu2_mds using nid 192.168.11.226 at tcp.
   ....<snip>..
   
/var/log/messages on MDS:
   Dec 14 16:31:26 mds2 kernel: LustreError: 0:0:(ldlm_lockd.c:205:waiting_locks_callback()) ### lock callback timer expired: evicting client e689a9c4-2a46-9239-7fd7-5a7e2c8c6542 at NET_0x20000c0a80b4b_UUID nid 192.168.11.75 at tcp  ns: mds-misu2_mds_UUID lock: 0000010075f50700/0x959917d3f3b88931 lrc: 1/0,0 mode: CR/CR res: 4780751/2726154257 bits 0x3 rrc: 7 type: IBT flags: 30 remote: 0x2a6036f763199bff expref: 7 pid 5373
   Dec 14 16:31:26 mds2 kernel: LustreError: 0:0:(ldlm_lockd.c:205:waiting_locks_callback()) Skipped 1 previous similar message
   Dec 14 16:31:26 mds2 kernel: Lustre: 5376:0:(mds_reint.c:125:mds_finish_transno()) commit transaction for disconnected client e689a9c4-2a46-9239-7fd7-5a7e2c8c6542: rc 0
   Dec 14 16:31:26 mds2 kernel: LustreError: 5394:0:(handler.c:1478:mds_handle()) operation 101 on unconnected MDS from 12345-192.168.11.75 at tcp
   Dec 14 16:31:26 mds2 kernel: LustreError: 5394:0:(handler.c:1478:mds_handle()) Skipped 1 previous similar message
   Dec 14 16:31:26 mds2 kernel: LustreError: 5394:0:(ldlm_lib.c:1343:target_send_reply_msg()) @@@ processing error (-107)  req at 000001012551c450 x852/t0 o101-><?>@<?>:-1 lens 392/0 ref 0 fl Interpret:/0/0 rc -107/0
   Dec 14 16:31:26 mds2 kernel: LustreError: 5394:0:(ldlm_lib.c:1343:target_send_reply_msg()) Skipped 2 previous similar messages
   Dec 14 16:33:06 mds2 kernel: LustreError: 0:0:(ldlm_lockd.c:205:waiting_locks_callback()) ### lock callback timer expired: evicting client e689a9c4-2a46-9239-7fd7-5a7e2c8c6542 at NET_0x20000c0a80b4b_UUID nid 192.168.11.75 at tcp  ns: mds-misu2_mds_UUID lock: 00000100b829fb00/0x959917d3f3b89ff1 lrc: 1/0,0 mode: CR/CR res: 4780751/2726154257 bits 0x3 rrc: 8 type: IBT flags: 30 remote: 0x2a6036f763199c3e expref: 6 pid 5398
   Dec 14 16:33:06 mds2 kernel: Lustre: 5373:0:(mds_reint.c:125:mds_finish_transno()) commit transaction for disconnected client e689a9c4-2a46-9239-7fd7-5a7e2c8c6542: rc 0
   Dec 14 16:33:06 mds2 kernel: LustreError: 5372:0:(handler.c:1478:mds_handle()) operation 101 on unconnected MDS from 12345-192.168.11.75 at tcp
   Dec 14 16:33:06 mds2 kernel: LustreError: 5372:0:(ldlm_lib.c:1343:target_send_reply_msg()) @@@ processing error (-107)  req at 0000010027297800 x927/t0 o101-><?>@<?>:-1 lens 392/0 ref 0 fl Interpret:/0/0 rc -107/0
   ....<snip>...

Thanks for any help/input,
   
/Per Lundqvist

-- 
Per Lundqvist

National Supercomputer Centre
Linköping University, Sweden

http://www.nsc.liu.se


More information about the lustre-discuss mailing list