[Lustre-discuss] lustre client network stack hanging

Derek Yarnell derek at umiacs.umd.edu
Thu Oct 22 22:09:28 PDT 2009


So I have been trying to find out if someone else has reported or  
found something similar.  I would be happy to create a bug report but  
I searched bugzilla for a bit and haven't found out much.  So the  
weirdest thing is that the MDS/OSS servers are fine but the clients  
whole network stack gets screwed up.  I mean it stops pinging which is  
just very odd that Lustre is causing problems to this extent.

Anyone heard or know of anything like this attached are the syslogs  
from when the clients network stack hung and the MDS/MGS.

Note: the client cfd-mds-01 is not running any MDS/MGT services just a  
patch-less client for now.

Client (lustre-client-1.8.1-2.6.18_128.1.14.el5_lustre.1.8.1)

Oct 22 12:35:11 cfd-mds-01 kernel: LustreError: 4682:0:(socklnd.c: 
1661:ksocknal_destroy_conn()) Completing partial receive from  
12345-192.168.14.23 at tcp, ip 192.168.14.23:1022, with error
Oct 22 12:35:11 cfd-mds-01 kernel: LustreError: 4682:0:(events.c: 
189:client_bulk_callback()) event type 1, status -5, desc  
ffff8100c7672000
Oct 22 12:37:59 cfd-mds-01 kernel: Lustre: 4678:0:(linux-tcpip.c: 
688:libcfs_sock_connect()) Error -110 connecting 192.168.14.21/1020 ->  
192.168.14.20/988
Oct 22 12:37:59 cfd-mds-01 kernel: Lustre: 4678:0:(acceptor.c: 
102:lnet_connect_console_error()) Connection to 192.168.14.20 at tcp at  
host 192.168.14.20 on port 988 took too long: that node may be hung or  
experiencing high load.
Oct 22 12:41:09 cfd-mds-01 kernel: Lustre: 4681:0:(linux-tcpip.c: 
688:libcfs_sock_connect()) Error -110 connecting 192.168.14.21/1021 ->  
192.168.14.20/988
Oct 22 12:41:09 cfd-mds-01 kernel: Lustre: 4681:0:(acceptor.c: 
102:lnet_connect_console_error()) Connection to 192.168.14.20 at tcp at  
host 192.168.14.20 on port 988 took too long: that node may be hung or  
experiencing high load.
Oct 22 12:44:20 cfd-mds-01 kernel: Lustre: 4679:0:(linux-tcpip.c: 
688:libcfs_sock_connect()) Error -110 connecting 192.168.14.21/1021 ->  
192.168.14.20/988
Oct 22 12:44:20 cfd-mds-01 kernel: Lustre: 4679:0:(acceptor.c: 
102:lnet_connect_console_error()) Connection to 192.168.14.20 at tcp at  
host 192.168.14.20 on port 988 took too long: that node may be hung or  
experiencing high load.
Oct 22 12:47:33 cfd-mds-01 kernel: Lustre: 4680:0:(linux-tcpip.c: 
688:libcfs_sock_connect()) Error -110 connecting 192.168.14.21/1021 ->  
192.168.14.20/988
Oct 22 12:47:33 cfd-mds-01 kernel: Lustre: 4680:0:(acceptor.c: 
102:lnet_connect_console_error()) Connection to 192.168.14.20 at tcp at  
host 192.168.14.20 on port 988 took too long: that node may be hung or  
experiencing high load.
Oct 22 12:50:51 cfd-mds-01 kernel: Lustre: 4678:0:(linux-tcpip.c: 
688:libcfs_sock_connect()) Error -110 connecting 192.168.14.21/1021 ->  
192.168.14.20/988
Oct 22 12:50:51 cfd-mds-01 kernel: Lustre: 4678:0:(acceptor.c: 
102:lnet_connect_console_error()) Connection to 192.168.14.20 at tcp at  
host 192.168.14.20 on port 988 took too long: that node may be hung or  
experiencing high load.
Oct 22 12:54:16 cfd-mds-01 kernel: Lustre: 4681:0:(linux-tcpip.c: 
688:libcfs_sock_connect()) Error -110 connecting 192.168.14.21/1021 ->  
192.168.14.20/988
Oct 22 12:54:16 cfd-mds-01 kernel: Lustre: 4681:0:(acceptor.c: 
102:lnet_connect_console_error()) Connection to 192.168.14.20 at tcp at  
host 192.168.14.20 on port 988 took too long: that node may be hung or  
experiencing high load.
Oct 22 12:57:57 cfd-mds-01 kernel: Lustre: 4679:0:(linux-tcpip.c: 
688:libcfs_sock_connect()) Error -110 connecting 192.168.14.21/1021 ->  
192.168.14.20/988
Oct 22 12:57:57 cfd-mds-01 kernel: Lustre: 4679:0:(acceptor.c: 
102:lnet_connect_console_error()) Connection to 192.168.14.20 at tcp at  
host 192.168.14.20 on port 988 took too long: that node may be hung or  
experiencing high load.
Oct 22 13:02:07 cfd-mds-01 kernel: Lustre: 4680:0:(linux-tcpip.c: 
688:libcfs_sock_connect()) Error -110 connecting 192.168.14.21/1021 ->  
192.168.14.20/988
Oct 22 13:02:07 cfd-mds-01 kernel: Lustre: 4680:0:(acceptor.c: 
102:lnet_connect_console_error()) Connection to 192.168.14.20 at tcp at  
host 192.168.14.20 on port 988 took too long: that node may be hung or  
experiencing high load.
Oct 22 13:06:16 cfd-mds-01 kernel: Lustre: 4678:0:(linux-tcpip.c: 
688:libcfs_sock_connect()) Error -110 connecting 192.168.14.21/1021 ->  
192.168.14.20/988
Oct 22 13:06:16 cfd-mds-01 kernel: Lustre: 4678:0:(acceptor.c: 
102:lnet_connect_console_error()) Connection to 192.168.14.20 at tcp at  
host 192.168.14.20 on port 988 took too long: that node may be hung or  
experiencing high load.
Oct 22 13:10:25 cfd-mds-01 kernel: Lustre: 4681:0:(linux-tcpip.c: 
688:libcfs_sock_connect()) Error -110 connecting 192.168.14.21/1021 ->  
192.168.14.20/988
Oct 22 13:10:25 cfd-mds-01 kernel: Lustre: 4681:0:(acceptor.c: 
102:lnet_connect_console_error()) Connection to 192.168.14.20 at tcp at  
host 192.168.14.20 on port 988 took too long: that node may be hung or  
experiencing high load.
Oct 22 13:14:34 cfd-mds-01 kernel: Lustre: 4679:0:(linux-tcpip.c: 
688:libcfs_sock_connect()) Error -110 connecting 192.168.14.21/1021 ->  
192.168.14.20/988
Oct 22 13:14:34 cfd-mds-01 kernel: Lustre: 4679:0:(acceptor.c: 
102:lnet_connect_console_error()) Connection to 192.168.14.20 at tcp at  
host 192.168.14.20 on port 988 took too long: that node may be hung or  
experiencing high load.
Oct 22 13:18:43 cfd-mds-01 kernel: Lustre: 4680:0:(linux-tcpip.c: 
688:libcfs_sock_connect()) Error -110 connecting 192.168.14.21/1021 ->  
192.168.14.20/988
Oct 22 13:18:43 cfd-mds-01 kernel: Lustre: 4680:0:(acceptor.c: 
102:lnet_connect_console_error()) Connection to 192.168.14.20 at tcp at  
host 192.168.14.20 on port 988 took too long: that node may be hung or  
experiencing high load.
Oct 22 13:22:52 cfd-mds-01 kernel: Lustre: 4678:0:(linux-tcpip.c: 
688:libcfs_sock_connect()) Error -110 connecting 192.168.14.21/1021 ->  
192.168.14.20/988
Oct 22 13:22:52 cfd-mds-01 kernel: Lustre: 4678:0:(acceptor.c: 
102:lnet_connect_console_error()) Connection to 192.168.14.20 at tcp at  
host 192.168.14.20 on port 988 took too long: that node may be hung or  
experiencing high load.
Oct 22 13:24:13 cfd-mds-01 kernel: Lustre: 4684:0:(client.c: 
1383:ptlrpc_expire_one_request()) @@@ Request x1317187295870120 sent  
from cfd-OST0003-osc-ffff8103216b8800 to NID 192.168.14.23 at tcp 2997s  
ago has timed out (limit 1344s).
Oct 22 13:24:13 cfd-mds-01 kernel: Lustre: cfd-OST0003-osc- 
ffff8103216b8800: Connection to service cfd-OST0003 via nid  
192.168.14.23 at tcp was lost; in progress operations using this service  
will wait for recovery to complete.
Oct 22 13:24:13 cfd-mds-01 kernel: LustreError: 11-0: an error  
occurred while communicating with 192.168.14.23 at tcp. The ost_connect  
operation failed with -16
Oct 22 13:24:38 cfd-mds-01 kernel: Lustre: 4686:0:(import.c: 
508:import_select_connection()) cfd-OST0003-osc-ffff8103216b8800:  
tried all connections, increasing latency to 6s
Oct 22 13:24:38 cfd-mds-01 kernel: Lustre: cfd-OST0003-osc- 
ffff8103216b8800: Connection restored to service cfd-OST0003 using nid  
192.168.14.23 at tcp.
Oct 22 13:26:50 cfd-mds-01 kernel: Lustre: 4684:0:(client.c: 
1383:ptlrpc_expire_one_request()) @@@ Request x1317187296186310 sent  
from MGC192.168.14.20 at tcp to NID 192.168.14.20 at tcp 7s ago has timed  
out (limit 7s).
Oct 22 13:26:50 cfd-mds-01 kernel: LustreError: 166-1:  
MGC192.168.14.20 at tcp: Connection to service MGS via nid  
192.168.14.20 at tcp was lost; in progress operations using this service  
will fail.
Oct 22 13:26:56 cfd-mds-01 kernel: Lustre: 4685:0:(client.c: 
1383:ptlrpc_expire_one_request()) @@@ Request x1317187296186311 sent  
from MGC192.168.14.20 at tcp to NID 192.168.14.20 at tcp 6s ago has timed  
out (limit 6s).
Oct 22 13:26:57 cfd-mds-01 kernel: Lustre: cfd-MDT0000-mdc- 
ffff8103216b8800: Connection to service cfd-MDT0000 via nid  
192.168.14.20 at tcp was lost; in progress operations using this service  
will wait for recovery to complete.
Oct 22 13:27:02 cfd-mds-01 kernel: Lustre: 4681:0:(linux-tcpip.c: 
688:libcfs_sock_connect()) Error -110 connecting 192.168.14.21/1021 ->  
192.168.14.20/988
Oct 22 13:27:02 cfd-mds-01 kernel: Lustre: 4681:0:(acceptor.c: 
102:lnet_connect_console_error()) Connection to 192.168.14.20 at tcp at  
host 192.168.14.20 on port 988 took too long: that node may be hung or  
experiencing high load.
Oct 22 13:27:02 cfd-mds-01 kernel: Lustre: 4684:0:(client.c: 
1383:ptlrpc_expire_one_request()) @@@ Request x1317187296186316 sent  
from cfd-OST0003-osc-ffff8103216b8800 to NID 192.168.14.23 at tcp 12s ago  
has timed out (limit 12s).
Oct 22 13:27:02 cfd-mds-01 kernel: Lustre: 4684:0:(client.c: 
1383:ptlrpc_expire_one_request()) Skipped 4 previous similar messages
Oct 22 13:27:02 cfd-mds-01 kernel: Lustre: cfd-OST0003-osc- 
ffff8103216b8800: Connection to service cfd-OST0003 via nid  
192.168.14.23 at tcp was lost; in progress operations using this service  
will wait for recovery to complete.
Oct 22 13:27:02 cfd-mds-01 kernel: Lustre: Skipped 3 previous similar  
messages
Oct 22 13:27:08 cfd-mds-01 kernel: Lustre: 4684:0:(client.c: 
1383:ptlrpc_expire_one_request()) @@@ Request x1317187296186309 sent  
from cfd-OST0001-osc-ffff8103216b8800 to NID 192.168.14.23 at tcp 44s ago  
has timed out (limit 44s).
Oct 22 13:27:08 cfd-mds-01 kernel: Lustre: 4684:0:(client.c: 
1383:ptlrpc_expire_one_request()) Skipped 4 previous similar messages
Oct 22 13:27:12 cfd-mds-01 kernel: Lustre: 4682:0:(socklnd_cb.c: 
2173:ksocknal_find_timed_out_conn()) A connection with  
12345-192.168.14.22 at tcp (192.168.14.22:988) timed out; the network or  
node may be down.

MDS (lustre-1.8.1-2.6.18_128.1.14.el5_lustre.1.8.1)

Oct 22 12:34:50 cfd-mds-00 kernel: Lustre: 24837:0:(socklnd_cb.c: 
2173:ksocknal_find_timed_out_conn()) A connection with  
12345-192.168.14.21 at tcp (192.168.14.21:1021) timed out; the network or  
node may be down.
Oct 22 13:27:14 cfd-mds-00 kernel: Lustre: 24837:0:(socklnd_cb.c: 
915:ksocknal_launch_packet()) No usable routes to  
12345-192.168.14.21 at tcp
Oct 22 13:27:26 cfd-mds-00 kernel: Lustre: 24837:0:(socklnd_cb.c: 
915:ksocknal_launch_packet()) No usable routes to  
12345-192.168.14.21 at tcp
Oct 22 13:27:26 cfd-mds-00 kernel: Lustre: 24837:0:(socklnd_cb.c: 
2181:ksocknal_find_timed_out_conn()) An unexpected network error 113  
occurred with 12345-192.168.14.21 at tcp (192.168.14.21:1022
Oct 22 13:30:11 cfd-mds-00 kernel: Lustre: cfd-MDT0000: haven't heard  
from client 2d7ea85b-2184-0e60-e96f-fe2cd01a4b3e (at  
192.168.14.21 at tcp) in 227 seconds. I think it's dead, and I am  
evicting it.
Oct 22 13:30:30 cfd-mds-00 kernel: Lustre: MGS: haven't heard from  
client 58fe30f2-259f-a304-9f56-696ec03db7c0 (at 192.168.14.21 at tcp) in  
227 seconds. I think it's dead, and I am evicting it.


Derek Yarnell
UNIX Systems Administrator
University of Maryland
Institute for Advanced Computer Studies






More information about the lustre-discuss mailing list