[Lustre-discuss] Intermittent routing errors

Deon Borman deon at blackginger.tv
Fri Jan 29 07:00:10 PST 2010


Hi all,

I have a weird problem on one of my OSSs, though I've seen it once on 
the other OSS. Things will be humming along nicely, when suddenly I get 
lots of messages like this:

Jan 29 15:26:16 venus kernel: Lustre: 
898:0:(socklnd_cb.c:915:ksocknal_launch_packet()) No usable routes to 
12345-192.168.1.26 at tcp
Jan 29 15:26:16 venus kernel: Lustre: 
1090:0:(socklnd_cb.c:915:ksocknal_launch_packet()) No usable routes to 
12345-192.168.1.26 at tcp

In the 50 odd minutes before I picked it up, it produced over 10 million 
such lines in /var/log/messages. Performance degrade systematically 
during this time on all clients. On the client node in question, IO is 
disrupted until I unmount and remount the OSTs. Then the problem goes 
away for a week or so. The part of the logs where this error starts, is 
at the end of this mail. Oh, and I can ping/ssh the machine in question 
from the server in question at the time of the problem. So it doesn't 
seem to be a general networking problem.

A bit of info regarding my setup, in case it has something to do with 
this: I have a shared MDS/MGS and two OSSs, all on Dell servers. The OSS 
that's giving me the most headaches, has 8GB RAM, 4 x 4TB OSTs and an 
Intel 10GB NIC. The other OSS has 4GB RAM, 1 x 4.8TB OSTs and two GB 
Intel NICs that's bonded using the 802.3ad dynamic link aggregation 
protocol. I have about 200 clients connecting to this file system. I 
have another lustre system, comprising of Intel based component servers, 
that acts as a mirror. This system has been running fine. All the 
servers are running Centos 5.4 64bit and lustre 1.8.1.1. The clients are 
running the Suse 11 lustre kernel.

So, does anybody know what's going on here? Or have any pointers as to 
how I can debug this?

Any and all help appreciated,
Deon

/var/log/messages just before the flood starts:

Jan 29 14:28:54 venus kernel: Lustre: 
4254:0:(socklnd_cb.c:2173:ksocknal_find_timed_out_conn()) A connection 
with 12345-192.168.0.99 at tcp (192.168.0.99:1023) timed out; the network 
or node may be down.
Jan 29 14:31:42 venus kernel: Lustre: galaxy-OST0000: haven't heard from 
client 0afbaa24-aa3c-07d6-5752-10300b3997ba (at 192.168.0.99 at tcp) in 227 
seconds. I think it's dead, and I am evicting it.
Jan 29 14:31:42 venus kernel: Lustre: galaxy-OST0003: haven't heard from 
client 0afbaa24-aa3c-07d6-5752-10300b3997ba (at 192.168.0.99 at tcp) in 227 
seconds. I think it's dead, and I am evicting it.
Jan 29 14:31:42 venus kernel: Lustre: Skipped 1 previous similar message
Jan 29 14:38:45 venus kernel: Lustre: 
4254:0:(socklnd_cb.c:2173:ksocknal_find_timed_out_conn()) A connection 
with 12345-192.168.1.26 at tcp (192.168.1.26:1023) timed out; the network 
or node may be d
own.
Jan 29 14:40:50 venus kernel: Lustre: 
4251:0:(linux-tcpip.c:688:libcfs_sock_connect()) Error -113 connecting 
192.168.0.16/1023 -> 192.168.1.26/988
Jan 29 14:40:50 venus kernel: Lustre: 
4251:0:(acceptor.c:95:lnet_connect_console_error()) Connection to 
192.168.1.26 at tcp at host 192.168.1.26 was unreachable: the network or 
that node may be down,
 or Lustre may be misconfigured.
Jan 29 14:40:50 venus kernel: Lustre: 
4251:0:(socklnd_cb.c:421:ksocknal_txlist_done()) Deleting packet type 1 
len 296 192.168.0.16 at tcp->192.168.1.26 at tcp
Jan 29 14:40:54 venus kernel: Lustre: 
1090:0:(client.c:1383:ptlrpc_expire_one_request()) @@@ Request 
x1325271916821767 sent from galaxy-OST0001 to NID 192.168.1.26 at tcp 7s 
ago has timed out (limit
7s).
Jan 29 14:40:54 venus kernel:   req at ffff81016c500000 
x1325271916821767/t0 o106->@:15/16 lens 296/424 e 0 to 1 dl 1264768854 
ref 1 fl Rpc:/0/0 rc 0/0
Jan 29 14:40:57 venus kernel: Lustre: 
4253:0:(linux-tcpip.c:688:libcfs_sock_connect()) Error -113 connecting 
192.168.0.16/1023 -> 192.168.1.26/988
Jan 29 14:40:57 venus kernel: Lustre: 
4253:0:(acceptor.c:95:lnet_connect_console_error()) Connection to 
192.168.1.26 at tcp at host 192.168.1.26 was unreachable: the network or 
that node may be down,
 or Lustre may be misconfigured.
Jan 29 14:40:57 venus kernel: Lustre: 
4253:0:(socklnd_cb.c:421:ksocknal_txlist_done()) Deleting packet type 1 
len 296 192.168.0.16 at tcp->192.168.1.26 at tcp
Jan 29 14:40:59 venus kernel: Lustre: 
898:0:(socklnd_cb.c:915:ksocknal_launch_packet()) No usable routes to 
12345-192.168.1.26 at tcp
Jan 29 14:40:59 venus kernel: LustreError: 
898:0:(events.c:66:request_out_callback()) @@@ type 4, status -5  
req at ffff8102b94ff000 x1325271916822084/t0 o106->@:15/16 lens 296/424 e 0 
to 1 dl 126476
8866 ref 2 fl Rpc:/0/0 rc 0/0
Jan 29 14:40:59 venus kernel: LustreError: 
898:0:(events.c:66:request_out_callback()) Skipped 3929552 previous 
similar messages
Jan 29 14:40:59 venus kernel: Lustre: 
898:0:(client.c:1383:ptlrpc_expire_one_request()) @@@ Request 
x1325271916822084 sent from galaxy-OST0002 to NID 192.168.1.26 at tcp 0s 
ago has failed due to netw
ork error (limit 7s).
Jan 29 14:40:59 venus kernel:   req at ffff8102b94ff000 
x1325271916822084/t0 o106->@:15/16 lens 296/424 e 0 to 1 dl 1264768866 
ref 1 fl Rpc:/0/0 rc 0/0
Jan 29 14:40:59 venus kernel: Lustre: 
898:0:(socklnd_cb.c:915:ksocknal_launch_packet()) No usable routes to 
12345-192.168.1.26 at tcp
Jan 29 14:40:59 venus last message repeated 648 times
Jan 29 14:41:02 venus kernel: Lustre: 
4252:0:(linux-tcpip.c:688:libcfs_sock_connect()) Error -113 connecting 
192.168.0.16/1023 -> 192.168.1.26/988
Jan 29 14:41:02 venus kernel: Lustre: 
4252:0:(acceptor.c:95:lnet_connect_console_error()) Connection to 
192.168.1.26 at tcp at host 192.168.1.26 was unreachable: the network or 
that node may be down,
 or Lustre may be misconfigured.
Jan 29 14:41:02 venus kernel: Lustre: 
4252:0:(socklnd_cb.c:421:ksocknal_txlist_done()) Deleting packet type 1 
len 296 192.168.0.16 at tcp->192.168.1.26 at tcp
Jan 29 14:41:02 venus kernel: Lustre: 
4252:0:(socklnd_cb.c:421:ksocknal_txlist_done()) Deleting packet type 1 
len 296 192.168.0.16 at tcp->192.168.1.26 at tcp
Jan 29 14:41:06 venus kernel: Lustre: 
898:0:(client.c:1383:ptlrpc_expire_one_request()) @@@ Request 
x1325271916822084 sent from galaxy-OST0002 to NID 192.168.1.26 at tcp 7s 
ago has timed out (limit 7
s).
Jan 29 14:41:06 venus kernel:   req at ffff8102b94ff000 
x1325271916822084/t0 o106->@:15/16 lens 296/424 e 0 to 1 dl 1264768866 
ref 1 fl Rpc:/2/0 rc 0/0
Jan 29 14:41:06 venus kernel: Lustre: 
898:0:(client.c:1383:ptlrpc_expire_one_request()) Skipped 650 previous 
similar messages
Jan 29 14:41:09 venus kernel: Lustre: 
4250:0:(linux-tcpip.c:688:libcfs_sock_connect()) Error -113 connecting 
192.168.0.16/1023 -> 192.168.1.26/988
Jan 29 14:41:09 venus kernel: Lustre: 
4250:0:(acceptor.c:95:lnet_connect_console_error()) Connection to 
192.168.1.26 at tcp at host 192.168.1.26 was unreachable: the network or 
that node may be down,
 or Lustre may be misconfigured.
Jan 29 14:41:09 venus kernel: Lustre: 
4250:0:(socklnd_cb.c:421:ksocknal_txlist_done()) Deleting packet type 1 
len 296 192.168.0.16 at tcp->192.168.1.26 at tcp
Jan 29 14:41:09 venus kernel: Lustre: 
4250:0:(socklnd_cb.c:421:ksocknal_txlist_done()) Deleting packet type 1 
len 296 192.168.0.16 at tcp->192.168.1.26 at tcp
Jan 29 14:41:13 venus kernel: Lustre: 
898:0:(client.c:1383:ptlrpc_expire_one_request()) @@@ Request 
x1325271916822084 sent from galaxy-OST0002 to NID 192.168.1.26 at tcp 7s 
ago has timed out (limit 7
s).
Jan 29 14:41:13 venus kernel:   req at ffff8102b94ff000 
x1325271916822084/t0 o106->@:15/16 lens 296/424 e 0 to 1 dl 1264768873 
ref 1 fl Rpc:/2/0 rc 0/0
Jan 29 14:41:13 venus kernel: Lustre: 
898:0:(client.c:1383:ptlrpc_expire_one_request()) Skipped 1 previous 
similar message
Jan 29 14:41:13 venus kernel: Lustre: 
898:0:(socklnd_cb.c:915:ksocknal_launch_packet()) No usable routes to 
12345-192.168.1.26 at tcp
Jan 29 14:41:15 venus last message repeated 84370 times
Jan 29 14:41:15 venus kernel: Lustre: 
1090:0:(socklnd_cb.c:915:ksocknal_launch_packet()) No usable routes to 
12345-192.168.1.26 at tcp
Jan 29 14:41:15 venus kernel: Lustre: 
898:0:(socklnd_cb.c:915:ksocknal_launch_packet()) No usable routes to 
12345-192.168.1.26 at tcp

-- 
Deon Borman
IT Supervisor
BlackGinger
--




More information about the lustre-discuss mailing list