[Lustre-discuss] Lustre I/O server (OSS) goes down, another I/O server takes the service overloads itself and performance goes down.

Fri Aug 26 03:05:54 PDT 2011

Replaying to myself to resume a bit the bunch of logs sended yesterday.

I'm trying to understand the following block of logs:

It seems quite extrange to me that big numbers (for example 21018 times
having a refused connection ) of similar messages on so small amount of
time, less than a second.

Is there any way to ask lustre WHY is refusing that connection?

many thanks.

Marc

Aug 22 18:58:13 s_kernel at n4 kernel: Lustre:
23954:0:(ldlm_lib.c:542:target_handle_reconnect()) scratch-OST0034:
1ccbe619-062d-09df-9e3b-3ff3960bfa6a reconnecting
Aug 22 18:58:13 s_kernel at n4 kernel: Lustre:
23954:0:(ldlm_lib.c:542:target_handle_reconnect()) Skipped 21021 previous
similar messages
Aug 22 18:58:13 s_kernel at n4 kernel: Lustre:
23954:0:(ldlm_lib.c:799:target_handle_connect()) scratch-OST0034: refuse
reconnection from 1ccbe619-062d-09df-9e3b-3ff3960bfa6a at 10.2.1.15@tcp to
0xffff81070c578000; still busy with 1 active RPCs
Aug 22 18:58:13 s_kernel at n4 kernel: Lustre:
23954:0:(ldlm_lib.c:799:target_handle_connect()) Skipped 21018 previous
similar messages
Aug 22 18:59:01 s_sys at n4 crond[21701]: (perf) CMD (/opt/perf/bin/
ganglia_metrics.pl -jobs -bond0   &>/dev/null)
Aug 22 18:59:17 s_kernel at n4 kernel: Lustre:
24195:0:(ldlm_lib.c:799:target_handle_connect()) scratch-OST0034: refuse
reconnection from 1ccbe619-062d-09df-9e3b-3ff3960bfa6a at 10.2.1.15@tcp to
0xffff81070c578000; still busy with 1 active RPCs
Aug 22 18:59:17 s_kernel at n4 kernel: Lustre:
24195:0:(ldlm_lib.c:799:target_handle_connect()) Skipped
previous similar messages

Aug 22 18:59:53 s_kernel at n4 kernel: Lustre: 0:0:(watchdog.c:183:lcw_cb())
Watchdog triggered for pid 13380: it was inactive for 200.00s
Aug 22 18:59:53 s_kernel at n4 kernel: Lustre: 0:0:(watchdog.c:183:lcw_cb())
Skipped 4 previous similar messages
Aug 22 18:59:53 s_kernel at n4 kernel: Lustre:
0:0:(linux-debug.c:222:libcfs_debug_dumpstack()) showing stack for process
13380
Aug 22 18:59:53 s_kernel at n4 kernel: Lustre:
0:0:(linux-debug.c:222:libcfs_debug_dumpstack()) Skipped 4 previous similar
messages
Aug 22 18:59:53 s_kernel at n4 kernel: ll_ost_io_49  S ffff81000900c980     0
13380      1         13381 13379 (L-TLB)
Aug 22 18:59:53 s_kernel at n4 kernel: ffff81061fd55a10 0000000000000046
ffff8104ca43f400 0000000000000000
Aug 22 18:59:53 s_kernel at n4 kernel: 0000000000000000 000000000000000a
ffff8106201db040 ffff8106400677d0
Aug 22 18:59:53 s_kernel at n4 kernel: 00063866b1dac473 0000000000000b17
ffff8106201db228 0000000100000000
Aug 22 18:59:53 s_kernel at n4 kernel: Call Trace:
Aug 22 18:59:53 s_kernel at n4 kernel: [<ffffffff888c5f27>]
:ptlrpc:target_queue_recovery_request+0xb37/0x12b0
Aug 22 18:59:53 s_kernel at n4 kernel: [<ffffffff8008a88e>]
default_wake_function+0x0/0xe
Aug 22 18:59:53 s_kernel at n4 kernel: [<ffffffff88b5d11b>]
:ost:ost_handle+0x37b/0x5cd0
Aug 22 18:59:53 s_kernel at n4 kernel: [<ffffffff80063118>]
thread_return+0x62/0xfe
Aug 22 18:59:53 s_kernel at n4 kernel: [<ffffffff800c5d06>]
free_pages_bulk+0x1ee/0x266
Aug 22 18:59:53 s_kernel at n4 kernel: [<ffffffff800c9cb4>]
zone_statistics+0x3e/0x6d
Aug 22 18:59:53 s_kernel at n4 kernel: [<ffffffff8014a88b>]
__next_cpu+0x19/0x28
Aug 22 18:59:53 s_kernel at n4 kernel: [<ffffffff800893ba>]
find_busiest_group+0x20d/0x621
Aug 22 18:59:53 s_kernel at n4 kernel: [<ffffffff888f7325>]
:ptlrpc:lustre_msg_get_conn_cnt+0x35/0xf0
Aug 22 18:59:53 s_kernel at n4 kernel: [<ffffffff8008a228>]
enqueue_task+0x41/0x56
Aug 22 18:59:53 s_kernel at n4 kernel: [<ffffffff888ffa6a>]
:ptlrpc:ptlrpc_server_request_get+0x6a/0x150
Aug 22 18:59:53 s_kernel at n4 kernel: [<ffffffff8890190d>]
:ptlrpc:ptlrpc_check_req+0x1d/0x110
Aug 22 18:59:53 s_kernel at n4 kernel: [<ffffffff88903e9e>]
:ptlrpc:ptlrpc_server_handle_request+0xa9e/0x1160
Aug 22 18:59:53 s_kernel at n4 kernel: [<ffffffff80063118>]
thread_return+0x62/0xfe
Aug 22 18:59:53 s_kernel at n4 kernel: [<ffffffff80088c83>]
__wake_up_common+0x3e/0x68
Aug 22 18:59:53 s_kernel at n4 kernel: [<ffffffff88907388>]
:ptlrpc:ptlrpc_main+0x1208/0x13d0
Aug 22 18:59:53 s_kernel at n4 kernel: [<ffffffff8008a88e>]
default_wake_function+0x0/0xe
Aug 22 18:59:53 s_kernel at n4 kernel: [<ffffffff800b5eed>]
audit_syscall_exit+0x31b/0x336
Aug 22 18:59:53 s_kernel at n4 kernel: [<ffffffff8005e01d>] child_rip+0xa/0x11
Aug 22 18:59:53 s_kernel at n4 kernel: [<ffffffff88906180>]
:ptlrpc:ptlrpc_main+0x0/0x13d0
Aug 22 18:59:53 s_kernel at n4 kernel: [<ffffffff8005e013>] child_rip+0x0/0x11
Aug 22 18:59:53 s_kernel at n4 kernel:
Aug 22 18:59:53 s_kernel at n4 kernel: LustreError: dumping log to
/tmp/lustre-log.1314032393.13380

On 25 August 2011 18:37, Joan Marc <joanmarcriera at gmail.com> wrote:

> Hello,
>
> I'm having something that I can not understand. I quite new on lustre.
>
>
> n4 and n5 are two I/O servers, each one taking care of diferent datasets.
>
> When one of them goes down, the other is supposed to handle the service as
> backup node. But n4 goes down without having heavy loads, and when n5 gets
> the service its cpu and memory goes to the top and performance goes to the
> bottom.
>
> Can someone tell me which lines to look at , and from here I can start
> checking this issue?
>
> Here the a 30 minutes syslog with lustre related messages from n4 :
> http://pastebin.com/q1iGwDxw
>
> Here almost the same 30 minutes from n5.
> http://pastebin.com/4Bg5repa
>
>
> Many thanks.
>
> Marc
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20110826/7fc078ee/attachment.htm>