[Lustre-discuss] login nodes still hang with 1.6.6

Sun Dec 7 06:09:06 PST 2008

Hello,
We upgraded out clients to 1.6.6 and the servers are still 1.6.5  we  
are still seeing where the login nodes much more than the compute  
nodes start being evicted but never are able to recover,

LustreError: 11-0: an error occurred while communicating with  
10.164.3.246 at tcp. The mds_connect operation failed with -16
LustreError: 15616:0:(ldlm_request.c:996:ldlm_cli_cancel_req()) Got  
rc -11 from cancel RPC: canceling anyway
LustreError: 15616:0:(ldlm_request.c:1605:ldlm_cli_cancel_list())  
ldlm_cli_cancel_list: -11
Lustre: Request x960964 sent from nobackup-MDT0000- 
mdc-00000100f7eb0800 to NID 10.164.3.247 at tcp 100s ago has timed out  
(limit 100s).
Lustre: Skipped 2 previous similar messages
LustreError: 167-0: This client was evicted by nobackup-MDT0000; in  
progress operations using this service will fail.
LustreError: 23549:0:(mdc_locks.c:598:mdc_enqueue())  
ldlm_cli_enqueue: -5
LustreError: 19442:0:(client.c:722:ptlrpc_import_delay_req()) @@@  
IMP_INVALID  req at 000001008d7e0400 x961041/t0 o35->nobackup- 
MDT0000_UUID at 10.164.3.246@tc
p:23/10 lens 296/1248 e 0 to 100 dl 0 ref 1 fl Rpc:/0/0 rc 0/0
LustreError: 19442:0:(client.c:722:ptlrpc_import_delay_req()) Skipped  
691 previous similar messages
LustreError: 19442:0:(file.c:116:ll_close_inode_openhandle()) inode  
93290605 mdc close failed: rc = -108
Lustre: nobackup-MDT0000-mdc-00000100f7eb0800: Connection restored to  
service nobackup-MDT0000 using nid 10.164.3.246 at tcp.
LustreError: 23549:0:(mdc_request.c:741:mdc_close()) Unexpected:  
can't find mdc_open_data, but the close succeeded.  Please tell  
<http://bugzilla.lustre.
org/>.
Lustre: nobackup-MDT0000-mdc-00000100f7eb0800: Connection to service  
nobackup-MDT0000 via nid 10.164.3.246 at tcp was lost; in progress  
operations using thi
s service will wait for recovery to complete.
LustreError: 11-0: an error occurred while communicating with  
10.164.3.246 at tcp. The mds_connect operation failed with -16
Lustre: 3879:0:(import.c:410:import_select_connection()) nobackup- 
MDT0000-mdc-00000100f7eb0800: tried all connections, increasing  
latency to 36s
Lustre: 3879:0:(import.c:410:import_select_connection()) Skipped 2  
previous similar messages
LustreError: 11-0: an error occurred while communicating with  
10.164.3.246 at tcp. The mds_connect operation failed with -16
LustreError: 11-0: an error occurred while communicating with  
10.164.3.246 at tcp. The mds_connect operation failed with -16
Lustre: Changing connection for nobackup-MDT0000-mdc-00000100f7eb0800  
to 10.164.3.246 at tcp/10.164.3.246 at tcp

Is this the same bug?  The compute nodes look mostly ok, but the  
above still happens every few days.   I don't notice any mention of  
statahead but should I go ahead and set it to 0 again?

Brock Palen
www.umich.edu/~brockp
Center for Advanced Computing
brockp at umich.edu
(734)936-1985