[Lustre-discuss] Clients fail every now and again,

Fri Nov 14 15:37:19 PST 2008

We consistantly see random ocurances of a client being kicked out,  
and while lustre says it tries to reconnect, it almost never can  
without a reboot:

Nov 14 18:28:18 nyx-login1 kernel: LustreError: 14130:0:(import.c: 
226:ptlrpc_invalidate_import()) nobackup-MDT0000_UUID: rc = -110  
waiting for callback (3 != 0)
Nov 14 18:28:18 nyx-login1 kernel: LustreError: 14130:0:(import.c: 
230:ptlrpc_invalidate_import()) @@@ still on sending list   
req at 000001015dd9ec00 x979024/t0 o101->nobackup- 
MDT0000_UUID at 10.164.3.246@tcp:12/10 lens 448/1184 e 0 to 100 dl  
1226700928 ref 1 fl Rpc:RES/0/0 rc -4/0
Nov 14 18:28:18 nyx-login1 kernel: LustreError: 14130:0:(import.c: 
230:ptlrpc_invalidate_import()) Skipped 1 previous similar messageNov  
14 18:28:18 nyx-login1 kernel: Lustre: nobackup-MDT0000- 
mdc-00000100f7ef0400: Connection restored to service nobackup-MDT0000  
using nid 10.164.3.246 at tcp.
Nov 14 18:30:32 nyx-login1 kernel: LustreError: 11-0: an error  
occurred while communicating with 10.164.3.246 at tcp. The mds_statfs  
operation failed with -107
Nov 14 18:30:32 nyx-login1 kernel: Lustre: nobackup-MDT0000- 
mdc-00000100f7ef0400: Connection to service nobackup-MDT0000 via nid  
10.164.3.246 at tcp was lost; in progress operations using this service  
will wait for recovery to complete.
Nov 14 18:30:32 nyx-login1 kernel: LustreError: 167-0: This client  
was evicted by nobackup-MDT0000; in progress operations using this  
service will fail.
Nov 14 18:30:32 nyx-login1 kernel: LustreError: 16523:0:(llite_lib.c: 
1549:ll_statfs_internal()) mdc_statfs fails: rc = -5
Nov 14 18:30:35 nyx-login1 kernel: LustreError: 16525:0:(client.c: 
716:ptlrpc_import_delay_req()) @@@ IMP_INVALID  req at 000001000990fe00  
x983192/t0 o41->nobackup-MDT0000_UUID at 10.164.3.246@tcp:12/10 lens  
128/400 e 0 to 100 dl 0 ref 1 fl Rpc:/0/0 rc 0/0
Nov 14 18:30:35 nyx-login1 kernel: LustreError: 16525:0:(llite_lib.c: 
1549:ll_statfs_internal()) mdc_statfs fails: rc = -108

Is there any way to make lustre more robust against these types of  
failures?  According to the manual (and many times in practice, like  
rebooting a MDS)  the filesystem will just block and comeback.  This  
almost never comes back, after a while it will say reconnected, but  
will fail again right away.

On the MDS I see:

Nov 14 18:30:20 mds1 kernel: Lustre: nobackup-MDT0000: haven't heard  
from client 1284bfca-91bd-03f6-649c-f591e5d807d5 (at  
141.212.31.43 at tcp) in 227 seconds. I think it's dead, and I am  
evicting it.
Nov 14 18:30:28 mds1 kernel: LustreError: 11463:0:(handler.c: 
1515:mds_handle()) operation 41 on unconnected MDS from  
12345-141.212.31.43 at tcp
Nov 14 18:30:28 mds1 kernel: LustreError: 11463:0:(ldlm_lib.c: 
1536:target_send_reply_msg()) @@@ processing error (-107)   
req at 00000103f84eae00 x983190/t0 o41-><?>@<?>:0/0 lens 128/0 e 0 to 0  
dl 1226705528 ref 1 fl Interpret:/0/0 rc -107/0
Nov 14 18:34:15 mds1 kernel: Lustre: nobackup-MDT0000: haven't heard  
from client 1284bfca-91bd-03f6-649c-f591e5d807d5 (at  
141.212.31.43 at tcp) in 227 seconds. I think it's dead, and I am  
evicting it.

Just keeps kicking it out,  /proc/fs/lustre/health_check on client,  
and servers are healthy.

Brock Palen
www.umich.edu/~brockp
Center for Advanced Computing
brockp at umich.edu
(734)936-1985