[Lustre-discuss] lustre client network stack hanging
Derek Yarnell
derek at umiacs.umd.edu
Thu Oct 22 22:09:28 PDT 2009
So I have been trying to find out if someone else has reported or
found something similar. I would be happy to create a bug report but
I searched bugzilla for a bit and haven't found out much. So the
weirdest thing is that the MDS/OSS servers are fine but the clients
whole network stack gets screwed up. I mean it stops pinging which is
just very odd that Lustre is causing problems to this extent.
Anyone heard or know of anything like this attached are the syslogs
from when the clients network stack hung and the MDS/MGS.
Note: the client cfd-mds-01 is not running any MDS/MGT services just a
patch-less client for now.
Client (lustre-client-1.8.1-2.6.18_128.1.14.el5_lustre.1.8.1)
Oct 22 12:35:11 cfd-mds-01 kernel: LustreError: 4682:0:(socklnd.c:
1661:ksocknal_destroy_conn()) Completing partial receive from
12345-192.168.14.23 at tcp, ip 192.168.14.23:1022, with error
Oct 22 12:35:11 cfd-mds-01 kernel: LustreError: 4682:0:(events.c:
189:client_bulk_callback()) event type 1, status -5, desc
ffff8100c7672000
Oct 22 12:37:59 cfd-mds-01 kernel: Lustre: 4678:0:(linux-tcpip.c:
688:libcfs_sock_connect()) Error -110 connecting 192.168.14.21/1020 ->
192.168.14.20/988
Oct 22 12:37:59 cfd-mds-01 kernel: Lustre: 4678:0:(acceptor.c:
102:lnet_connect_console_error()) Connection to 192.168.14.20 at tcp at
host 192.168.14.20 on port 988 took too long: that node may be hung or
experiencing high load.
Oct 22 12:41:09 cfd-mds-01 kernel: Lustre: 4681:0:(linux-tcpip.c:
688:libcfs_sock_connect()) Error -110 connecting 192.168.14.21/1021 ->
192.168.14.20/988
Oct 22 12:41:09 cfd-mds-01 kernel: Lustre: 4681:0:(acceptor.c:
102:lnet_connect_console_error()) Connection to 192.168.14.20 at tcp at
host 192.168.14.20 on port 988 took too long: that node may be hung or
experiencing high load.
Oct 22 12:44:20 cfd-mds-01 kernel: Lustre: 4679:0:(linux-tcpip.c:
688:libcfs_sock_connect()) Error -110 connecting 192.168.14.21/1021 ->
192.168.14.20/988
Oct 22 12:44:20 cfd-mds-01 kernel: Lustre: 4679:0:(acceptor.c:
102:lnet_connect_console_error()) Connection to 192.168.14.20 at tcp at
host 192.168.14.20 on port 988 took too long: that node may be hung or
experiencing high load.
Oct 22 12:47:33 cfd-mds-01 kernel: Lustre: 4680:0:(linux-tcpip.c:
688:libcfs_sock_connect()) Error -110 connecting 192.168.14.21/1021 ->
192.168.14.20/988
Oct 22 12:47:33 cfd-mds-01 kernel: Lustre: 4680:0:(acceptor.c:
102:lnet_connect_console_error()) Connection to 192.168.14.20 at tcp at
host 192.168.14.20 on port 988 took too long: that node may be hung or
experiencing high load.
Oct 22 12:50:51 cfd-mds-01 kernel: Lustre: 4678:0:(linux-tcpip.c:
688:libcfs_sock_connect()) Error -110 connecting 192.168.14.21/1021 ->
192.168.14.20/988
Oct 22 12:50:51 cfd-mds-01 kernel: Lustre: 4678:0:(acceptor.c:
102:lnet_connect_console_error()) Connection to 192.168.14.20 at tcp at
host 192.168.14.20 on port 988 took too long: that node may be hung or
experiencing high load.
Oct 22 12:54:16 cfd-mds-01 kernel: Lustre: 4681:0:(linux-tcpip.c:
688:libcfs_sock_connect()) Error -110 connecting 192.168.14.21/1021 ->
192.168.14.20/988
Oct 22 12:54:16 cfd-mds-01 kernel: Lustre: 4681:0:(acceptor.c:
102:lnet_connect_console_error()) Connection to 192.168.14.20 at tcp at
host 192.168.14.20 on port 988 took too long: that node may be hung or
experiencing high load.
Oct 22 12:57:57 cfd-mds-01 kernel: Lustre: 4679:0:(linux-tcpip.c:
688:libcfs_sock_connect()) Error -110 connecting 192.168.14.21/1021 ->
192.168.14.20/988
Oct 22 12:57:57 cfd-mds-01 kernel: Lustre: 4679:0:(acceptor.c:
102:lnet_connect_console_error()) Connection to 192.168.14.20 at tcp at
host 192.168.14.20 on port 988 took too long: that node may be hung or
experiencing high load.
Oct 22 13:02:07 cfd-mds-01 kernel: Lustre: 4680:0:(linux-tcpip.c:
688:libcfs_sock_connect()) Error -110 connecting 192.168.14.21/1021 ->
192.168.14.20/988
Oct 22 13:02:07 cfd-mds-01 kernel: Lustre: 4680:0:(acceptor.c:
102:lnet_connect_console_error()) Connection to 192.168.14.20 at tcp at
host 192.168.14.20 on port 988 took too long: that node may be hung or
experiencing high load.
Oct 22 13:06:16 cfd-mds-01 kernel: Lustre: 4678:0:(linux-tcpip.c:
688:libcfs_sock_connect()) Error -110 connecting 192.168.14.21/1021 ->
192.168.14.20/988
Oct 22 13:06:16 cfd-mds-01 kernel: Lustre: 4678:0:(acceptor.c:
102:lnet_connect_console_error()) Connection to 192.168.14.20 at tcp at
host 192.168.14.20 on port 988 took too long: that node may be hung or
experiencing high load.
Oct 22 13:10:25 cfd-mds-01 kernel: Lustre: 4681:0:(linux-tcpip.c:
688:libcfs_sock_connect()) Error -110 connecting 192.168.14.21/1021 ->
192.168.14.20/988
Oct 22 13:10:25 cfd-mds-01 kernel: Lustre: 4681:0:(acceptor.c:
102:lnet_connect_console_error()) Connection to 192.168.14.20 at tcp at
host 192.168.14.20 on port 988 took too long: that node may be hung or
experiencing high load.
Oct 22 13:14:34 cfd-mds-01 kernel: Lustre: 4679:0:(linux-tcpip.c:
688:libcfs_sock_connect()) Error -110 connecting 192.168.14.21/1021 ->
192.168.14.20/988
Oct 22 13:14:34 cfd-mds-01 kernel: Lustre: 4679:0:(acceptor.c:
102:lnet_connect_console_error()) Connection to 192.168.14.20 at tcp at
host 192.168.14.20 on port 988 took too long: that node may be hung or
experiencing high load.
Oct 22 13:18:43 cfd-mds-01 kernel: Lustre: 4680:0:(linux-tcpip.c:
688:libcfs_sock_connect()) Error -110 connecting 192.168.14.21/1021 ->
192.168.14.20/988
Oct 22 13:18:43 cfd-mds-01 kernel: Lustre: 4680:0:(acceptor.c:
102:lnet_connect_console_error()) Connection to 192.168.14.20 at tcp at
host 192.168.14.20 on port 988 took too long: that node may be hung or
experiencing high load.
Oct 22 13:22:52 cfd-mds-01 kernel: Lustre: 4678:0:(linux-tcpip.c:
688:libcfs_sock_connect()) Error -110 connecting 192.168.14.21/1021 ->
192.168.14.20/988
Oct 22 13:22:52 cfd-mds-01 kernel: Lustre: 4678:0:(acceptor.c:
102:lnet_connect_console_error()) Connection to 192.168.14.20 at tcp at
host 192.168.14.20 on port 988 took too long: that node may be hung or
experiencing high load.
Oct 22 13:24:13 cfd-mds-01 kernel: Lustre: 4684:0:(client.c:
1383:ptlrpc_expire_one_request()) @@@ Request x1317187295870120 sent
from cfd-OST0003-osc-ffff8103216b8800 to NID 192.168.14.23 at tcp 2997s
ago has timed out (limit 1344s).
Oct 22 13:24:13 cfd-mds-01 kernel: Lustre: cfd-OST0003-osc-
ffff8103216b8800: Connection to service cfd-OST0003 via nid
192.168.14.23 at tcp was lost; in progress operations using this service
will wait for recovery to complete.
Oct 22 13:24:13 cfd-mds-01 kernel: LustreError: 11-0: an error
occurred while communicating with 192.168.14.23 at tcp. The ost_connect
operation failed with -16
Oct 22 13:24:38 cfd-mds-01 kernel: Lustre: 4686:0:(import.c:
508:import_select_connection()) cfd-OST0003-osc-ffff8103216b8800:
tried all connections, increasing latency to 6s
Oct 22 13:24:38 cfd-mds-01 kernel: Lustre: cfd-OST0003-osc-
ffff8103216b8800: Connection restored to service cfd-OST0003 using nid
192.168.14.23 at tcp.
Oct 22 13:26:50 cfd-mds-01 kernel: Lustre: 4684:0:(client.c:
1383:ptlrpc_expire_one_request()) @@@ Request x1317187296186310 sent
from MGC192.168.14.20 at tcp to NID 192.168.14.20 at tcp 7s ago has timed
out (limit 7s).
Oct 22 13:26:50 cfd-mds-01 kernel: LustreError: 166-1:
MGC192.168.14.20 at tcp: Connection to service MGS via nid
192.168.14.20 at tcp was lost; in progress operations using this service
will fail.
Oct 22 13:26:56 cfd-mds-01 kernel: Lustre: 4685:0:(client.c:
1383:ptlrpc_expire_one_request()) @@@ Request x1317187296186311 sent
from MGC192.168.14.20 at tcp to NID 192.168.14.20 at tcp 6s ago has timed
out (limit 6s).
Oct 22 13:26:57 cfd-mds-01 kernel: Lustre: cfd-MDT0000-mdc-
ffff8103216b8800: Connection to service cfd-MDT0000 via nid
192.168.14.20 at tcp was lost; in progress operations using this service
will wait for recovery to complete.
Oct 22 13:27:02 cfd-mds-01 kernel: Lustre: 4681:0:(linux-tcpip.c:
688:libcfs_sock_connect()) Error -110 connecting 192.168.14.21/1021 ->
192.168.14.20/988
Oct 22 13:27:02 cfd-mds-01 kernel: Lustre: 4681:0:(acceptor.c:
102:lnet_connect_console_error()) Connection to 192.168.14.20 at tcp at
host 192.168.14.20 on port 988 took too long: that node may be hung or
experiencing high load.
Oct 22 13:27:02 cfd-mds-01 kernel: Lustre: 4684:0:(client.c:
1383:ptlrpc_expire_one_request()) @@@ Request x1317187296186316 sent
from cfd-OST0003-osc-ffff8103216b8800 to NID 192.168.14.23 at tcp 12s ago
has timed out (limit 12s).
Oct 22 13:27:02 cfd-mds-01 kernel: Lustre: 4684:0:(client.c:
1383:ptlrpc_expire_one_request()) Skipped 4 previous similar messages
Oct 22 13:27:02 cfd-mds-01 kernel: Lustre: cfd-OST0003-osc-
ffff8103216b8800: Connection to service cfd-OST0003 via nid
192.168.14.23 at tcp was lost; in progress operations using this service
will wait for recovery to complete.
Oct 22 13:27:02 cfd-mds-01 kernel: Lustre: Skipped 3 previous similar
messages
Oct 22 13:27:08 cfd-mds-01 kernel: Lustre: 4684:0:(client.c:
1383:ptlrpc_expire_one_request()) @@@ Request x1317187296186309 sent
from cfd-OST0001-osc-ffff8103216b8800 to NID 192.168.14.23 at tcp 44s ago
has timed out (limit 44s).
Oct 22 13:27:08 cfd-mds-01 kernel: Lustre: 4684:0:(client.c:
1383:ptlrpc_expire_one_request()) Skipped 4 previous similar messages
Oct 22 13:27:12 cfd-mds-01 kernel: Lustre: 4682:0:(socklnd_cb.c:
2173:ksocknal_find_timed_out_conn()) A connection with
12345-192.168.14.22 at tcp (192.168.14.22:988) timed out; the network or
node may be down.
MDS (lustre-1.8.1-2.6.18_128.1.14.el5_lustre.1.8.1)
Oct 22 12:34:50 cfd-mds-00 kernel: Lustre: 24837:0:(socklnd_cb.c:
2173:ksocknal_find_timed_out_conn()) A connection with
12345-192.168.14.21 at tcp (192.168.14.21:1021) timed out; the network or
node may be down.
Oct 22 13:27:14 cfd-mds-00 kernel: Lustre: 24837:0:(socklnd_cb.c:
915:ksocknal_launch_packet()) No usable routes to
12345-192.168.14.21 at tcp
Oct 22 13:27:26 cfd-mds-00 kernel: Lustre: 24837:0:(socklnd_cb.c:
915:ksocknal_launch_packet()) No usable routes to
12345-192.168.14.21 at tcp
Oct 22 13:27:26 cfd-mds-00 kernel: Lustre: 24837:0:(socklnd_cb.c:
2181:ksocknal_find_timed_out_conn()) An unexpected network error 113
occurred with 12345-192.168.14.21 at tcp (192.168.14.21:1022
Oct 22 13:30:11 cfd-mds-00 kernel: Lustre: cfd-MDT0000: haven't heard
from client 2d7ea85b-2184-0e60-e96f-fe2cd01a4b3e (at
192.168.14.21 at tcp) in 227 seconds. I think it's dead, and I am
evicting it.
Oct 22 13:30:30 cfd-mds-00 kernel: Lustre: MGS: haven't heard from
client 58fe30f2-259f-a304-9f56-696ec03db7c0 (at 192.168.14.21 at tcp) in
227 seconds. I think it's dead, and I am evicting it.
Derek Yarnell
UNIX Systems Administrator
University of Maryland
Institute for Advanced Computer Studies
More information about the lustre-discuss
mailing list