[Lustre-discuss] Lustre clients failing, and cant reconnect

Brock Palen brockp at umich.edu
Thu Sep 4 19:58:34 PDT 2008


I am having clients lose their connection to the MDS.  Messages on  
the clients look like this:

Sep  4 19:51:30 nyx-login2 kernel: Lustre: nobackup-MDT0000- 
mdc-00000101fc44e800: Connection to service nobackup-MDT0000 via nid  
10.164.3.246 at tcp was lost; in progress operations using this service  
will wait for recovery to complete.
Sep  4 19:51:30 nyx-login2 kernel: LustreError: 11-0: an error  
occurred while communicating with 10.164.3.246 at tcp. The mds_connect  
operation failed with -16

It will keep doing this trying to connect and spiting out mds_connect  
failed -16.  The clients never recover.

On the mds  all I see is:

Lustre: 7653:0:(ldlm_lib.c:760:target_handle_connect()) nobackup- 
MDT0000: refuse reconnection from 618cf36e-a7a6- 
a7d9-077c-7cbaee1e80b3 at 141.212.31.43@tcp to 0x000001037c109000; still  
busy with 3 active RPCs

This is common between many hosts that I get this RPC message.

Clients and servers are all using TCP.

Is this enough information?

Brock Palen
www.umich.edu/~brockp
Center for Advanced Computing
brockp at umich.edu
(734)936-1985






More information about the lustre-discuss mailing list