[Lustre-discuss] login nodes still hang with 1.6.6
Brock Palen
brockp at umich.edu
Sun Dec 7 06:09:06 PST 2008
Hello,
We upgraded out clients to 1.6.6 and the servers are still 1.6.5 we
are still seeing where the login nodes much more than the compute
nodes start being evicted but never are able to recover,
LustreError: 11-0: an error occurred while communicating with
10.164.3.246 at tcp. The mds_connect operation failed with -16
LustreError: 15616:0:(ldlm_request.c:996:ldlm_cli_cancel_req()) Got
rc -11 from cancel RPC: canceling anyway
LustreError: 15616:0:(ldlm_request.c:1605:ldlm_cli_cancel_list())
ldlm_cli_cancel_list: -11
Lustre: Request x960964 sent from nobackup-MDT0000-
mdc-00000100f7eb0800 to NID 10.164.3.247 at tcp 100s ago has timed out
(limit 100s).
Lustre: Skipped 2 previous similar messages
LustreError: 167-0: This client was evicted by nobackup-MDT0000; in
progress operations using this service will fail.
LustreError: 23549:0:(mdc_locks.c:598:mdc_enqueue())
ldlm_cli_enqueue: -5
LustreError: 19442:0:(client.c:722:ptlrpc_import_delay_req()) @@@
IMP_INVALID req at 000001008d7e0400 x961041/t0 o35->nobackup-
MDT0000_UUID at 10.164.3.246@tc
p:23/10 lens 296/1248 e 0 to 100 dl 0 ref 1 fl Rpc:/0/0 rc 0/0
LustreError: 19442:0:(client.c:722:ptlrpc_import_delay_req()) Skipped
691 previous similar messages
LustreError: 19442:0:(file.c:116:ll_close_inode_openhandle()) inode
93290605 mdc close failed: rc = -108
Lustre: nobackup-MDT0000-mdc-00000100f7eb0800: Connection restored to
service nobackup-MDT0000 using nid 10.164.3.246 at tcp.
LustreError: 23549:0:(mdc_request.c:741:mdc_close()) Unexpected:
can't find mdc_open_data, but the close succeeded. Please tell
<http://bugzilla.lustre.
org/>.
Lustre: nobackup-MDT0000-mdc-00000100f7eb0800: Connection to service
nobackup-MDT0000 via nid 10.164.3.246 at tcp was lost; in progress
operations using thi
s service will wait for recovery to complete.
LustreError: 11-0: an error occurred while communicating with
10.164.3.246 at tcp. The mds_connect operation failed with -16
Lustre: 3879:0:(import.c:410:import_select_connection()) nobackup-
MDT0000-mdc-00000100f7eb0800: tried all connections, increasing
latency to 36s
Lustre: 3879:0:(import.c:410:import_select_connection()) Skipped 2
previous similar messages
LustreError: 11-0: an error occurred while communicating with
10.164.3.246 at tcp. The mds_connect operation failed with -16
LustreError: 11-0: an error occurred while communicating with
10.164.3.246 at tcp. The mds_connect operation failed with -16
Lustre: Changing connection for nobackup-MDT0000-mdc-00000100f7eb0800
to 10.164.3.246 at tcp/10.164.3.246 at tcp
Is this the same bug? The compute nodes look mostly ok, but the
above still happens every few days. I don't notice any mention of
statahead but should I go ahead and set it to 0 again?
Brock Palen
www.umich.edu/~brockp
Center for Advanced Computing
brockp at umich.edu
(734)936-1985
More information about the lustre-discuss
mailing list