[Lustre-discuss] Lustre clients failing, and cant reconnect
Brock Palen
brockp at umich.edu
Thu Sep 4 19:58:34 PDT 2008
I am having clients lose their connection to the MDS. Messages on
the clients look like this:
Sep 4 19:51:30 nyx-login2 kernel: Lustre: nobackup-MDT0000-
mdc-00000101fc44e800: Connection to service nobackup-MDT0000 via nid
10.164.3.246 at tcp was lost; in progress operations using this service
will wait for recovery to complete.
Sep 4 19:51:30 nyx-login2 kernel: LustreError: 11-0: an error
occurred while communicating with 10.164.3.246 at tcp. The mds_connect
operation failed with -16
It will keep doing this trying to connect and spiting out mds_connect
failed -16. The clients never recover.
On the mds all I see is:
Lustre: 7653:0:(ldlm_lib.c:760:target_handle_connect()) nobackup-
MDT0000: refuse reconnection from 618cf36e-a7a6-
a7d9-077c-7cbaee1e80b3 at 141.212.31.43@tcp to 0x000001037c109000; still
busy with 3 active RPCs
This is common between many hosts that I get this RPC message.
Clients and servers are all using TCP.
Is this enough information?
Brock Palen
www.umich.edu/~brockp
Center for Advanced Computing
brockp at umich.edu
(734)936-1985
More information about the lustre-discuss
mailing list