[Lustre-discuss] Client Eviction Preceded by EHOSTUNREACH and then ENOTCONN?

Rick Wagner rpwagner at sdsc.edu
Tue Jul 12 13:56:40 PDT 2011


On Jul 12, 2011, at 11:34 AM, Rick Wagner wrote:

> 
> On Jul 12, 2011, at 11:32 AM, Isaac Huang wrote:
> 
>> On Tue, Jul 12, 2011 at 11:06:40AM -0700, Rick Wagner wrote:
>>> On Jul 12, 2011, at 11:01 AM, Isaac Huang wrote:
>>> 
>>>> On Mon, Jul 11, 2011 at 03:39:34PM -0700, Rick Wagner wrote:
>>>>> Hi,
>>>>> ......
>>>>> I am assuming that -113 is EHOSTUNREACH and -107 is ENOTCONN, and that the error codes from errno.h are being used.
>>>>> 
>>>>> We've been experiencing similar problems for a while, and we've never seen IP traffic have a problem. But, clients will begin to have trouble communicating with the Lustre server (seen because an LNET ping will return an I/O error), and things will only recover when an LNET ping is performed from the server to the client NID.
>>>> 
>>>> I'd suggest to enable console logging of network errors, by 'echo
>>>> +neterror > /proc/sys/lnet/printk'. Then some detailed debug messages
>>>> should show up in 'dmesg' when you have LNET connectivity problems.
>>> 
>>> Thanks, Isaac, I have put that in place. We have that in the sysctl configuration, as part of lnet.debug, and thought that was sufficient. But so far, dmesg and /var/log/messages have looked very similar.
>>> 
>>> [root at lustre-oss-0-2 ~]# cat /proc/sys/lnet/printk 
>>> warning error emerg console
>> 
>> You should be able to see 'neterror' in 'cat /proc/sys/lnet/printk'
>> output after 'echo +neterror > /proc/sys/lnet/printk', otherwise
>> it's a bug. This is different from lnet.debug.
> 
> I am, sorry if the previous post was misleading.
> [root at lustre-oss-0-2 ~]# cat /proc/sys/lnet/printk 
> neterror warning error emerg console
> [root at lustre-oss-0-2 ~]# 

Isaac, I think your suggestion led to more information. I re-ran the client code, and experienced the same problems. However, this time there were additional messages, in particular one about "No usable routes".

  Lustre: 24086:0:(socklnd_cb.c:922:ksocknal_launch_packet()) No usable routes to 12345-XXX.XXX.118.137 at tcp

In our configuration, the cluster I'm running on uses TCP, and should not have traffic going over a router. There is a second test system which does have an LNET router configured, so this is what have for our LNET configuration on the servers:

  [root at lustre-oss-0-2 ~]# cat /etc/modprobe.d/lnet.conf 
  options lnet networks=tcp(bond0)  routes="o2ib XXX.XXX.81.18 at tcp"

For the servers, all of the traffic goes out the bond0 interface.

The dmesg output is below. It shows peer not alive for the client that I saw fail. When I logged onto that client, it reported an I/O error when performing and LNET ping to the server that I've posted the messages for, and then they both recovered.

Thanks again,
Rick

From dmesg on lustre-oss-2-0

Lustre: 24086:0:(socklnd_cb.c:922:ksocknal_launch_packet()) No usable routes to 12345-XXX.XXX.118.137 at tcp
LustreError: 24086:0:(events.c:381:server_bulk_callback()) event type 2, status -113, desc ffff8104d5950000
LustreError: 24502:0:(ost_handler.c:1073:ost_brw_write()) @@@ network error on bulk GET 0(1048576)  req at ffff8101cbfb3000 x1373631114768318/t0 o4->ebb92dfb-9c0c-c870-a01a-0f0e7333c71b at NET_0x20000c6ca7689_UUID:0/0 lens 448/416 e 0 to 0 dl 1310502335 ref 1 fl Interpret:/0/0 rc 0/0
Lustre: 24502:0:(ost_handler.c:1224:ost_brw_write()) phase1-OST0020: ignoring bulk IO comm error with ebb92dfb-9c0c-c870-a01a-0f0e7333c71b at NET_0x20000c6ca7689_UUID id 12345-XXX.XXX.118.137 at tcp - client will retry
Lustre: 24349:0:(ldlm_lib.c:574:target_handle_reconnect()) phase1-OST0023: ebb92dfb-9c0c-c870-a01a-0f0e7333c71b reconnecting
Lustre: 24349:0:(ldlm_lib.c:574:target_handle_reconnect()) Skipped 2 previous similar messages
Lustre: 24307:0:(ldlm_lib.c:803:target_handle_connect()) phase1-OST0023: exp ffff810292b50e00 already connecting
LustreError: 24307:0:(ldlm_lib.c:1919:target_send_reply_msg()) @@@ processing error (-114)  req at ffff81033c17bc50 x1373631114768595/t0 o8->ebb92dfb-9c0c-c870-a01a-0f0e7333c71b at NET_0x20000c6ca7689_UUID:0/0 lens 368/264 e 0 to 0 dl 1310502459 ref 1 fl Interpret:/0/0 rc -114/0
LustreError: 24307:0:(ldlm_lib.c:1919:target_send_reply_msg()) Skipped 338 previous similar messages
Lustre: 24086:0:(socklnd_cb.c:922:ksocknal_launch_packet()) No usable routes to 12345-XXX.XXX.118.137 at tcp
LustreError: 24086:0:(events.c:381:server_bulk_callback()) event type 2, status -113, desc ffff810226052000
LustreError: 24553:0:(ost_handler.c:1073:ost_brw_write()) @@@ network error on bulk GET 0(581632)  req at ffff81024b96b000 x1373631114821718/t0 o4->ebb92dfb-9c0c-c870-a01a-0f0e7333c71b at NET_0x20000c6ca7689_UUID:0/0 lens 448/416 e 0 to 0 dl 1310502365 ref 1 fl Interpret:/2/0 rc 0/0
Lustre: 24553:0:(ost_handler.c:1224:ost_brw_write()) phase1-OST0023: ignoring bulk IO comm error with ebb92dfb-9c0c-c870-a01a-0f0e7333c71b at NET_0x20000c6ca7689_UUID id 12345-XXX.XXX.118.137 at tcp - client will retry
Lustre: 24527:0:(lib-move.c:1033:lnet_post_send_locked()) Dropping message for 12345-XXX.XXX.118.137 at tcp: peer not alive
Lustre: 24527:0:(lib-move.c:2445:LNetPut()) Error sending PUT to 12345-XXX.XXX.118.137 at tcp: -113
Lustre: 24536:0:(lib-move.c:1033:lnet_post_send_locked()) Dropping message for 12345-XXX.XXX.118.137 at tcp: peer not alive
Lustre: 24536:0:(lib-move.c:1033:lnet_post_send_locked()) Skipped 3 previous similar messages
Lustre: 24536:0:(lib-move.c:2445:LNetPut()) Error sending PUT to 12345-XXX.XXX.118.137 at tcp: -113
Lustre: 24536:0:(lib-move.c:2445:LNetPut()) Skipped 3 previous similar messages
Lustre: 24496:0:(lib-move.c:1033:lnet_post_send_locked()) Dropping message for 12345-XXX.XXX.118.137 at tcp: peer not alive
Lustre: 24496:0:(lib-move.c:1033:lnet_post_send_locked()) Skipped 3 previous similar messages
Lustre: 24496:0:(lib-move.c:2445:LNetPut()) Error sending PUT to 12345-XXX.XXX.118.137 at tcp: -113
Lustre: 24496:0:(lib-move.c:2445:LNetPut()) Skipped 3 previous similar messages
LustreError: 0:0:(ldlm_lockd.c:308:waiting_locks_callback()) ### lock callback timer expired after 100s: evicting client at XXX.XXX.118.137 at tcp  ns: filter-phase1-OST0023_UUID lock: ffff810154ed5000/0x891870f2e8a00d6b lrc: 3/0,0 mode: PR/PR res: 6449/0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x20 remote: 0x202026369bcdd0b6 expref: 17 pid: 18170 timeout 6508323681
LustreError: 24280:0:(client.c:841:ptlrpc_import_delay_req()) @@@ IMP_CLOSED   req at ffff8105ff468c00 x1372975847776577/t0 o105->@NET_0x20000c6ca7689_UUID:15/16 lens 344/384 e 0 to 1 dl 0 ref 1 fl Rpc:N/0/0 rc 0/0
LustreError: 24280:0:(ldlm_lockd.c:607:ldlm_handle_ast_error()) ### client (nid XXX.XXX.118.137 at tcp) returned 0 from completion AST ns: filter-phase1-OST0023_UUID lock: ffff810233131e00/0x891870f2e8a00f4e lrc: 3/0,0 mode: PW/PW res: 6451/0 rrc: 3 type: EXT [0->18446744073709551615] (req 0->1048575) flags: 0x0 remote: 0x202026369bcdecc4 expref: 15 pid: 18171 timeout 0
LustreError: 24483:0:(ost_handler.c:1060:ost_brw_write()) @@@ Eviction on bulk GET  req at ffff81061cce1800 x1373631114821716/t0 o4->ebb92dfb-9c0c-c870-a01a-0f0e7333c71b at NET_0x20000c6ca7689_UUID:0/0 lens 448/416 e 3 to 0 dl 1310502547 ref 1 fl Interpret:/2/0 rc 0/0
Lustre: 24483:0:(ost_handler.c:1224:ost_brw_write()) phase1-OST0023: ignoring bulk IO comm error with ebb92dfb-9c0c-c870-a01a-0f0e7333c71b at NET_0x20000c6ca7689_UUID id 12345-XXX.XXX.118.137 at tcp - client will retry
Lustre: 24483:0:(ost_handler.c:1224:ost_brw_write()) Skipped 1 previous similar message
Lustre: 24436:0:(lib-move.c:1033:lnet_post_send_locked()) Dropping message for 12345-XXX.XXX.118.137 at tcp: peer not alive
Lustre: 24436:0:(lib-move.c:1033:lnet_post_send_locked()) Skipped 3 previous similar messages
Lustre: 24436:0:(lib-move.c:2445:LNetPut()) Error sending PUT to 12345-XXX.XXX.118.137 at tcp: -113
Lustre: 24436:0:(lib-move.c:2445:LNetPut()) Skipped 3 previous similar messages
Lustre: 18190:0:(ldlm_lib.c:574:target_handle_reconnect()) phase1-OST0021: ebb92dfb-9c0c-c870-a01a-0f0e7333c71b reconnecting
Lustre: 18190:0:(ldlm_lib.c:574:target_handle_reconnect()) Skipped 5 previous similar messages
Lustre: Service thread pid 24433 was inactive for 200.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
Pid: 24433, comm: ll_ost_io_00

Call Trace:
 [<ffffffff800638ab>] schedule_timeout+0x8a/0xad
 [<ffffffff80097cb7>] process_timeout+0x0/0x5
 [<ffffffff889fd3fa>] ost_brw_write+0x139a/0x2480 [ost]
 [<ffffffff88761658>] ptlrpc_send_reply+0x5c8/0x5e0 [ptlrpc]
 [<ffffffff8872c8b0>] target_committed_to_req+0x40/0x120 [ptlrpc]
 [<ffffffff88765b05>] lustre_msg_get_version+0x35/0xf0 [ptlrpc]
 [<ffffffff8008cf93>] default_wake_function+0x0/0xe
 [<ffffffff88765bc8>] lustre_msg_check_version_v2+0x8/0x20 [ptlrpc]
 [<ffffffff88a0108e>] ost_handle+0x2bae/0x55b0 [ost]
 [<ffffffff8877515a>] ptlrpc_server_handle_request+0x97a/0xdf0 [ptlrpc]
 [<ffffffff8008cf93>] default_wake_function+0x0/0xe
 [<ffffffff88776817>] ptlrpc_main+0xf37/0x10f0 [ptlrpc]
 [<ffffffff8005dfb1>] child_rip+0xa/0x11
 [<ffffffff887758e0>] ptlrpc_main+0x0/0x10f0 [ptlrpc]
 [<ffffffff8005dfa7>] child_rip+0x0/0x11

Lustre: Service thread pid 24438 was inactive for 200.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
LustreError: dumping log to /tmp/lustre-log.1310502559.24433
Pid: 24438, comm: ll_ost_io_05

Call Trace:
 [<ffffffff800638ab>] schedule_timeout+0x8a/0xad
 [<ffffffff80097cb7>] process_timeout+0x0/0x5
 [<ffffffff889fd3fa>] ost_brw_write+0x139a/0x2480 [ost]
 [<ffffffff88761658>] ptlrpc_send_reply+0x5c8/0x5e0 [ptlrpc]
 [<ffffffff8872c8b0>] target_committed_to_req+0x40/0x120 [ptlrpc]
 [<ffffffff88765b05>] lustre_msg_get_version+0x35/0xf0 [ptlrpc]
 [<ffffffff8008cf93>] default_wake_function+0x0/0xe
 [<ffffffff88765bc8>] lustre_msg_check_version_v2+0x8/0x20 [ptlrpc]
 [<ffffffff88a0108e>] ost_handle+0x2bae/0x55b0 [ost]
 [<ffffffff8877515a>] ptlrpc_server_handle_request+0x97a/0xdf0 [ptlrpc]
 [<ffffffff88776817>] ptlrpc_main+0xf37/0x10f0 [ptlrpc]
 [<ffffffff8005dfb1>] child_rip+0xa/0x11
 [<ffffffff887758e0>] ptlrpc_main+0x0/0x10f0 [ptlrpc]
 [<ffffffff8005dfa7>] child_rip+0x0/0x11

LustreError: dumping log to /tmp/lustre-log.1310502560.24438
Lustre: 24406:0:(ldlm_lib.c:574:target_handle_reconnect()) phase1-OST0020: ebb92dfb-9c0c-c870-a01a-0f0e7333c71b reconnecting
Lustre: 24406:0:(ldlm_lib.c:874:target_handle_connect()) phase1-OST0020: refuse reconnection from ebb92dfb-9c0c-c870-a01a-0f0e7333c71b at XXX.XXX.118.137@tcp to 0xffff810496028400; still busy with 2 active RPCs
LustreError: 24406:0:(ldlm_lib.c:1919:target_send_reply_msg()) @@@ processing error (-16)  req at ffff81025c5a2050 x1373631114823077/t0 o8->ebb92dfb-9c0c-c870-a01a-0f0e7333c71b at NET_0x20000c6ca7689_UUID:0/0 lens 368/264 e 0 to 0 dl 1310502660 ref 1 fl Interpret:/0/0 rc -16/0
LustreError: 24438:0:(ost_handler.c:1064:ost_brw_write()) @@@ Reconnect on bulk GET  req at ffff81061cce1400 x1373631114821715/t0 o4->ebb92dfb-9c0c-c870-a01a-0f0e7333c71b at NET_0x20000c6ca7689_UUID:0/0 lens 448/416 e 4 to 0 dl 1310502755 ref 1 fl Interpret:/2/0 rc 0/0
Lustre: 24438:0:(ost_handler.c:1224:ost_brw_write()) phase1-OST0020: ignoring bulk IO comm error with ebb92dfb-9c0c-c870-a01a-0f0e7333c71b at NET_0x20000c6ca7689_UUID id 12345-XXX.XXX.118.137 at tcp - client will retry
Lustre: Service thread pid 24438 completed after 201.00s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).
LustreError: 24433:0:(ost_handler.c:1064:ost_brw_write()) @@@ Reconnect on bulk GET  req at ffff810242d00000 x1373631114821719/t0 o4->ebb92dfb-9c0c-c870-a01a-0f0e7333c71b at NET_0x20000c6ca7689_UUID:0/0 lens 448/416 e 4 to 0 dl 1310502755 ref 1 fl Interpret:/2/0 rc 0/0
Lustre: Service thread pid 24433 completed after 201.00s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).
Lustre: 24373:0:(ldlm_lib.c:574:target_handle_reconnect()) phase1-OST0020: ebb92dfb-9c0c-c870-a01a-0f0e7333c71b reconnecting





More information about the lustre-discuss mailing list