[Lustre-discuss] Lustre clients failing, and cant reconnect

Jerome, Ron Ron.Jerome at nrc-cnrc.gc.ca
Fri Sep 5 07:48:20 PDT 2008


For what it's worth...  I've seen similar problems with clients not
being able to connect to OSSs

SERVER OS: Linux oss1 2.6.18-53.1.14.el5_lustre.1.6.5.1smp #1 SMP Thu
Jun 26 01:38:50 EDT 2008 i686 i686 i386 GNU/Linux
CLIENT OS: Linux x15 2.6.18-53.1.14.el5_lustre.1.6.5smp #1 SMP Mon May
12 22:24:24 EDT 2008 x86_64 x86_64 x86_64 GNU/Linux


On the client side I see this...

====================================  CLIENT LOG
=================================================
Sep  1 15:17:22 x15 kernel: Lustre: Request x30990319 sent from
data-OST0004-osc-ffff81022067ec00 to NID 192.168.0.201 at tcp 100s ago has
timed out (limit
100s).
Sep  1 15:17:22 x15 kernel: Lustre: Skipped 9 previous similar messages
Sep  1 15:17:22 x15 kernel: Lustre: data-OST0004-osc-ffff81022067ec00:
Connection to service data-OST0004 via nid 192.168.0.201 at tcp was lost;
in progress
 operations using this service will wait for recovery to complete.
Sep  1 15:17:22 x15 kernel: LustreError:
3834:0:(ldlm_request.c:986:ldlm_cli_cancel_req()) Got rc -11 from cancel
RPC: canceling anyway
Sep  1 15:17:22 x15 kernel: LustreError:
3834:0:(ldlm_request.c:1575:ldlm_cli_cancel_list())
ldlm_cli_cancel_list: -11
Sep  1 15:17:22 x15 kernel: LustreError: 11-0: an error occurred while
communicating with 192.168.0.201 at tcp. The ost_connect operation failed
with -16
Sep  1 15:17:22 x15 kernel: LustreError: Skipped 2 previous similar
messages
Sep  1 15:17:47 x15 kernel: Lustre:
3216:0:(import.c:395:import_select_connection())
data-OST0004-osc-ffff81022067ec00: tried all connections, increasing
 latency to 6s
Sep  1 15:17:47 x15 kernel: Lustre:
3216:0:(import.c:395:import_select_connection()) Skipped 4 previous
similar messages
Sep  1 15:17:47 x15 kernel: LustreError: 11-0: an error occurred while
communicating with 192.168.0.201 at tcp. The ost_connect operation failed
with -16
Sep  1 15:18:37 x15 last message repeated 2 times
Sep  1 15:19:02 x15 kernel: LustreError: 11-0: an error occurred while
communicating with 192.168.0.201 at tcp. The ost_connect operation failed
with -16
Sep  1 15:19:27 x15 kernel: Lustre:
3216:0:(import.c:395:import_select_connection())
data-OST0004-osc-ffff81022067ec00: tried all connections, increasing
 latency to 26s
Sep  1 15:19:27 x15 kernel: Lustre:
3216:0:(import.c:395:import_select_connection()) Skipped 3 previous
similar messages
Sep  1 15:19:27 x15 kernel: LustreError: 11-0: an error occurred while
communicating with 192.168.0.201 at tcp. The ost_connect operation failed
with -16
Sep  1 15:20:17 x15 last message repeated 2 times
Sep  1 15:21:07 x15 kernel: LustreError: 11-0: an error occurred while
communicating with 192.168.0.201 at tcp. The ost_connect operation failed
with -16
Sep  1 15:21:07 x15 kernel: LustreError: Skipped 1 previous similar
message
Sep  1 15:21:57 x15 kernel: Lustre:
3216:0:(import.c:395:import_select_connection())
data-OST0004-osc-ffff81022067ec00: tried all connections, increasing
 latency to 51s
Sep  1 15:21:57 x15 kernel: Lustre:
3216:0:(import.c:395:import_select_connection()) Skipped 5 previous
similar messages
Sep  1 15:22:22 x15 kernel: LustreError: 11-0: an error occurred while
communicating with 192.168.0.201 at tcp. The ost_connect operation failed
with -16
Sep  1 15:22:22 x15 kernel: LustreError: Skipped 2 previous similar
messages
Sep  1 15:24:52 x15 kernel: LustreError: 11-0: an error occurred while
communicating with 192.168.0.201 at tcp. The ost_connect operation failed
with -16
Sep  1 15:24:52 x15 kernel: LustreError: Skipped 5 previous similar
messages
Sep  1 15:27:22 x15 kernel: Lustre:
3216:0:(import.c:395:import_select_connection())
data-OST0004-osc-ffff81022067ec00: tried all connections, increasing
 latency to 51s
Sep  1 15:27:22 x15 kernel: Lustre:
3216:0:(import.c:395:import_select_connection()) Skipped 12 previous
similar messages
Sep  1 15:29:27 x15 kernel: LustreError: 11-0: an error occurred while
communicating with 192.168.0.201 at tcp. The ost_connect operation failed
with -16
Sep  1 15:29:27 x15 kernel: LustreError: Skipped 10 previous similar
messages
Sep  1 15:37:47 x15 kernel: Lustre:
3216:0:(import.c:395:import_select_connection())
data-OST0004-osc-ffff81022067ec00: tried all connections, increasing
 latency to 51s
Sep  1 15:37:47 x15 kernel: Lustre:
3216:0:(import.c:395:import_select_connection()) Skipped 24 previous
similar messages
Sep  1 15:38:12 x15 kernel: LustreError: 11-0: an error occurred while
communicating with 192.168.0.201 at tcp. The ost_connect operation failed
with -16
Sep  1 15:38:12 x15 kernel: LustreError: Skipped 20 previous similar
messages
Sep  1 15:48:12 x15 kernel: Lustre:
3216:0:(import.c:395:import_select_connection())
data-OST0004-osc-ffff81022067ec00: tried all connections, increasing
 latency to 51s
====================================  END CLIENT LOG
=================================================


Server log at corresponding time...

====================================  SERVER LOG
=================================================
Aug 31 04:02:04 oss1 syslogd 1.4.1: restart.
Sep  1 15:15:42 oss1 kernel: LustreError:
2450:0:(ldlm_lock.c:430:__ldlm_handle2lock()) ASSERTION(lock->l_resource
!= NULL) failed
Sep  1 15:15:42 oss1 kernel: LustreError:
2450:0:(tracefile.c:432:libcfs_assertion_failed()) LBUG
Sep  1 15:15:42 oss1 kernel: Lustre:
2450:0:(linux-debug.c:167:libcfs_debug_dumpstack()) showing stack for
process 2450
Sep  1 15:15:42 oss1 kernel: ldlm_cn_01    R running  5128  2450      1
2451  2448 (L-TLB)
Sep  1 15:15:42 oss1 kernel:        f8df6614 48bc3fd2 c042dabe 00000082
e8addfc0 e1f77fcc 00000001 e8ad5ef4
Sep  1 15:15:42 oss1 kernel:        48bc3f5e 00059c88 e77907c0 23c34600
e737d2f4 f8d73b5c cb9f7000 e47d8000
Sep  1 15:15:42 oss1 kernel:        48bc3f3c c041f851 00000000 00000000
48bc3f5e 00059c88 00000080 e737d298
Sep  1 15:15:42 oss1 kernel: Call Trace:
Sep  1 15:15:42 oss1 kernel:  [<f8df6614>]
lustre_msg_get_conn_cnt+0xc4/0x210 [ptlrpc]
Sep  1 15:15:42 oss1 kernel:  [<c042dabe>] lock_timer_base+0x15/0x2f
Sep  1 15:15:42 oss1 kernel:  [<f8d73b5c>]
lc_watchdog_touch_ms+0xac/0x270 [libcfs]
Sep  1 15:15:42 oss1 kernel:  [<c041f851>] __wake_up+0x2a/0x3d
Sep  1 15:15:42 oss1 kernel:  [<f8e0dffe>] ptlrpc_main+0xabe/0x16d0
[ptlrpc]
Sep  1 15:15:42 oss1 kernel:  [<c0420291>] default_wake_function+0x0/0xc
Sep  1 15:15:42 oss1 kernel:  [<f8e0d540>] ptlrpc_main+0x0/0x16d0
[ptlrpc]
Sep  1 15:15:42 oss1 kernel:  [<c0405c0f>] kernel_thread_helper+0x7/0x10
Sep  1 15:15:42 oss1 kernel:  =======================
Sep  1 15:15:42 oss1 kernel: LustreError: dumping log to
/tmp/lustre-log.1220296542.2450
Sep  1 15:17:22 oss1 kernel: Lustre:
2500:0:(ldlm_lib.c:760:target_handle_connect()) data-OST0004: refuse
reconnection from 85d01dbb-c5f4-a525-3fc8-4c316490b85a at 192.168.0.115@tcp
to 0xcb9f7000; still busy with 2 active RPCs
Sep  1 15:17:47 oss1 kernel: Lustre:
2484:0:(ldlm_lib.c:760:target_handle_connect()) data-OST0004: refuse
reconnection from 85d01dbb-c5f4-a525-3fc8-4c316490b85a at 192.168.0.115@tcp
to 0xcb9f7000; still busy with 2 active RPCs
Sep  1 15:18:12 oss1 kernel: Lustre:
2481:0:(ldlm_lib.c:760:target_handle_connect()) data-OST0004: refuse
reconnection from 85d01dbb-c5f4-a525-3fc8-4c316490b85a at 192.168.0.115@tcp
to 0xcb9f7000; still busy with 2 active RPCs
Sep  1 15:18:37 oss1 kernel: Lustre:
2651:0:(ldlm_lib.c:760:target_handle_connect()) data-OST0004: refuse
reconnection from 85d01dbb-c5f4-a525-3fc8-4c316490b85a at 192.168.0.115@tcp
to 0xcb9f7000; still busy with 2 active RPCs
Sep  1 15:19:02 oss1 kernel: Lustre:
2481:0:(ldlm_lib.c:760:target_handle_connect()) data-OST0004: refuse
reconnection from 85d01dbb-c5f4-a525-3fc8-4c316490b85a at 192.168.0.115@tcp
to 0xcb9f7000; still busy with 2 active RPCs
Sep  1 15:19:27 oss1 kernel: Lustre:
2672:0:(ldlm_lib.c:760:target_handle_connect()) data-OST0004: refuse
reconnection from 85d01dbb-c5f4-a525-3fc8-4c316490b85a at 192.168.0.115@tcp
to 0xcb9f7000; still busy with 2 active RPCs
Sep  1 15:19:52 oss1 kernel: Lustre:
2677:0:(ldlm_lib.c:760:target_handle_connect()) data-OST0004: refuse
reconnection from 85d01dbb-c5f4-a525-3fc8-4c316490b85a at 192.168.0.115@tcp
to 0xcb9f7000; still busy with 2 active RPCs
Sep  1 15:20:42 oss1 kernel: Lustre:
2489:0:(ldlm_lib.c:760:target_handle_connect()) data-OST0004: refuse
reconnection from 85d01dbb-c5f4-a525-3fc8-4c316490b85a at 192.168.0.115@tcp
to 0xcb9f7000; still busy with 2 active RPCs
Sep  1 15:20:42 oss1 kernel: Lustre:
2489:0:(ldlm_lib.c:760:target_handle_connect()) Skipped 1 previous
similar message
Sep  1 15:21:57 oss1 kernel: Lustre:
2500:0:(ldlm_lib.c:760:target_handle_connect()) data-OST0004: refuse
reconnection from 85d01dbb-c5f4-a525-3fc8-4c316490b85a at 192.168.0.115@tcp
to 0xcb9f7000; still busy with 2 active RPCs
Sep  1 15:21:57 oss1 kernel: Lustre:
2500:0:(ldlm_lib.c:760:target_handle_connect()) Skipped 2 previous
similar messages
Sep  1 15:24:27 oss1 kernel: Lustre:
2720:0:(ldlm_lib.c:760:target_handle_connect()) data-OST0004: refuse
reconnection from 85d01dbb-c5f4-a525-3fc8-4c316490b85a at 192.168.0.115@tcp
to 0xcb9f7000; still busy with 2 active RPCs
Sep  1 15:24:27 oss1 kernel: Lustre:
2720:0:(ldlm_lib.c:760:target_handle_connect()) Skipped 5 previous
similar messages
Sep  1 15:25:42 oss1 kernel: Lustre: 0:0:(watchdog.c:130:lcw_cb())
Watchdog triggered for pid 2450: it was inactive for
600s
Sep  1 15:25:42 oss1 kernel: Lustre:
0:0:(linux-debug.c:167:libcfs_debug_dumpstack()) showing stack for
process 2450
Sep  1 15:25:42 oss1 kernel: ldlm_cn_01    D 5853A7AE  5128  2450      1
2451  2448 (L-TLB)
Sep  1 15:25:42 oss1 kernel:        e8ad5ce0 00000046 000001b0 5853a7ae
000990b5 00000000 0000000a f5cd5550
Sep  1 15:25:42 oss1 kernel:        f7c1c550 5853d51e 000990b5 00002d70
00000001 f5cd565c c20147e0 00000000
Sep  1 15:25:42 oss1 kernel:        e8ad5eb4 f8d830a0 f8d7615c ffffffff
00000000 00000000 00000000 f8d75fa0
Sep  1 15:25:42 oss1 kernel: Call Trace:
Sep  1 15:25:42 oss1 kernel:  [<f8d67d8c>] lbug_with_loc+0x9c/0xc0
[libcfs]
Sep  1 15:25:42 oss1 kernel:  [<f8d7287f>]
libcfs_assertion_failed+0x5f/0x60 [libcfs]
Sep  1 15:25:42 oss1 kernel:  [<f8d8e3cb>]
__ldlm_handle2lock+0x4db/0x6b0 [ptlrpc]
Sep  1 15:25:42 oss1 kernel:  [<f8df815e>]
lustre_msg_get_flags+0x4e/0x130 [ptlrpc]
Sep  1 15:25:42 oss1 kernel:  [<f8defe79>] ptlrpc_send_reply+0x299/0x840
[ptlrpc]
Sep  1 15:25:42 oss1 kernel:  [<f8dc3ae1>]
ldlm_request_cancel+0x131/0x7e0 [ptlrpc]
Sep  1 15:25:42 oss1 kernel:  [<f8dfd53f>] lustre_pack_reply+0x2f/0x140
[ptlrpc]
Sep  1 15:25:42 oss1 kernel:  [<f8dc423d>] ldlm_handle_cancel+0xad/0x570
[ptlrpc]
Sep  1 15:25:42 oss1 kernel:  [<f8df7854>] lustre_msg_get_opc+0xc4/0x210
[ptlrpc]
Sep  1 15:25:42 oss1 kernel:  [<f8df6614>]
lustre_msg_get_conn_cnt+0xc4/0x210 [ptlrpc]
Sep  1 15:25:42 oss1 kernel:  [<f8dc487e>]
ldlm_cancel_handler+0x17e/0xb80 [ptlrpc]
Sep  1 15:25:42 oss1 kernel:  [<f8e06305>] ptlrpc_check_req+0x15/0x220
[ptlrpc]
Sep  1 15:25:42 oss1 kernel:  [<f8df7e9e>]
lustre_msg_get_handle+0xce/0x210 [ptlrpc]
Sep  1 15:25:42 oss1 kernel:  [<c042ccfd>] do_gettimeofday+0x31/0xce
Sep  1 15:25:42 oss1 kernel:  [<f8e09fdd>]
ptlrpc_server_handle_request+0xf3d/0x1d90 [ptlrpc]
Sep  1 15:25:42 oss1 kernel:  [<c042ccfd>] do_gettimeofday+0x31/0xce
Sep  1 15:25:42 oss1 kernel:  [<f8d730bc>] lcw_update_time+0x1c/0x170
[libcfs]
Sep  1 15:25:42 oss1 kernel:  [<f8d730bc>] lcw_update_time+0x1c/0x170
[libcfs]
Sep  1 15:25:42 oss1 kernel:  [<f8df6614>]
lustre_msg_get_conn_cnt+0xc4/0x210 [ptlrpc]
Sep  1 15:25:42 oss1 kernel:  [<c042dabe>] lock_timer_base+0x15/0x2f
Sep  1 15:25:42 oss1 kernel:  [<f8d73b5c>]
lc_watchdog_touch_ms+0xac/0x270 [libcfs]
Sep  1 15:25:42 oss1 kernel:  [<c041f851>] __wake_up+0x2a/0x3d
Sep  1 15:25:42 oss1 kernel:  [<f8e0dffe>] ptlrpc_main+0xabe/0x16d0
[ptlrpc]
Sep  1 15:25:42 oss1 kernel:  [<c0420291>] default_wake_function+0x0/0xc
Sep  1 15:25:42 oss1 kernel:  [<f8e0d540>] ptlrpc_main+0x0/0x16d0
[ptlrpc]
Sep  1 15:25:42 oss1 kernel:  [<c0405c0f>] kernel_thread_helper+0x7/0x10
Sep  1 15:25:42 oss1 kernel:  =======================
Sep  1 15:25:42 oss1 kernel: LustreError: dumping log to
/tmp/lustre-log.1220297142.2450
Sep  1 15:29:02 oss1 kernel: Lustre:
2654:0:(ldlm_lib.c:760:target_handle_connect()) data-OST0004: refuse
reconnection from 85d01dbb-c5f4-a525-3fc8-4c316490b85a at 192.168.0.115@tcp
to 0xcb9f7000; still busy with 2 active RPCs
Sep  1 15:29:02 oss1 kernel: Lustre:
2654:0:(ldlm_lib.c:760:target_handle_connect()) Skipped 10 previous
similar messages
Sep  1 15:37:47 oss1 kernel: Lustre:
2703:0:(ldlm_lib.c:760:target_handle_connect()) data-OST0004: refuse
reconnection from 85d01dbb-c5f4-a525-3fc8-4c316490b85a at 192.168.0.115@tcp
to 0xcb9f7000; still busy with 2 active RPCs
Sep  1 15:37:47 oss1 kernel: Lustre:
2703:0:(ldlm_lib.c:760:target_handle_connect()) Skipped 20 previous
similar messages
Sep  1 15:48:12 oss1 kernel: Lustre:
2699:0:(ldlm_lib.c:760:target_handle_connect()) data-OST0004: refuse
reconnection from 85d01dbb-c5f4-a525-3fc8-4c316490b85a at 192.168.0.115@tcp
to 0xcb9f7000; still busy with 2 active RPCs
Sep  1 15:48:12 oss1 kernel: Lustre:
2699:0:(ldlm_lib.c:760:target_handle_connect()) Skipped 24 previous
similar messages
====================================  END SERVER LOG
=================================================

_________________________________________
Ron Jerome
National Research Council Canada
_________________________________________

> -----Original Message-----
> From: lustre-discuss-bounces at lists.lustre.org [mailto:lustre-discuss-
> bounces at lists.lustre.org] On Behalf Of Brock Palen
> Sent: September 5, 2008 10:22 AM
> To: Brian J. Murrell
> Cc: Lustre Discuss
> Subject: Re: [Lustre-discuss] Lustre clients failing, and cant
> reconnect
> 
> I had to reboot the MDS to get the problem to go away.
> I will watch and see if it reappears. I screwed up and deleted the
> wrong /var/log/messages  So I don't have the messages.
> 
> I am watching this issues.
> 
> Brock Palen
> www.umich.edu/~brockp
> Center for Advanced Computing
> brockp at umich.edu
> (734)936-1985
> 
> 
> 
> On Sep 5, 2008, at 10:01 AM, Brian J. Murrell wrote:
> > On Fri, 2008-09-05 at 00:15 -0400, Brock Palen wrote:
> >> Looks like that didn't fix it.  One of the login nodes repeated the
> >> behavior.
> >
> > So what are the messages the client logged when the problem
occurred?
> > And what, if anything was logged on the MDS at the same time?
> >
> > b.
> >
> > _______________________________________________
> > Lustre-discuss mailing list
> > Lustre-discuss at lists.lustre.org
> > http://lists.lustre.org/mailman/listinfo/lustre-discuss
> 
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss



More information about the lustre-discuss mailing list