[Lustre-discuss] Client disconnects from OSS

Dejan Lesjak dejan.lesjak at ijs.si
Wed Oct 28 03:50:34 PDT 2009


We seem to have a bit of a problem with our Lustre test setup. During
high load - about 250MB/s writes from one client to 3 OSS, the client
starts loosing connection to OSTs (sometimes to MDS) after about an hour
and the writes hang for about 10 minutes:

One of the servers:
Oct 27 22:41:41 f9sn005 kernel: Lustre:
4067:0:(ldlm_lib.c:541:target_handle_reconnect()) lustre-OST0003:
98ef67d6-49f3-2447-a52a-9cf3f49130cd reconnecting
Oct 27 22:41:41 f9sn005 kernel: Lustre:
4067:0:(ldlm_lib.c:541:target_handle_reconnect()) Skipped 2 previous
similar messages
Oct 27 22:41:41 f9sn005 kernel: Lustre:
4184:0:(ldlm_lib.c:835:target_handle_connect()) lustre-OST0000: refuse
reconnection from 98ef67d6-49f3-2447-a52a-9cf3f49130cd at 194.2
49.156.61 at tcp to 0xffff81023154b200; still busy with 2 active RPCs
Oct 27 22:41:41 f9sn005 kernel: LustreError:
4184:0:(ldlm_lib.c:1863:target_send_reply_msg()) @@@ processing error
(-16)  req at ffff81020ab50c50 x1317422623456031/t0 o8->98e
f67d6-49f3-2447-a52a-9cf3f49130cd at NET_0x20000c2f99c3d_UUID:0/0 lens
368/264 e 0 to 0 dl 1256680301 ref 1 fl Interpret:/0/0 rc -16/0
Oct 27 22:41:41 f9sn005 kernel: LustreError:
4184:0:(ldlm_lib.c:1863:target_send_reply_msg()) Skipped 1 previous
similar message
Oct 27 22:43:20 f9sn005 kernel: Lustre:
4083:0:(ldlm_lib.c:541:target_handle_reconnect()) lustre-OST0000:
98ef67d6-49f3-2447-a52a-9cf3f49130cd reconnecting


Client:
Oct 27 22:41:40 pikolit Lustre:
399:0:(client.c:1430:ptlrpc_expire_one_request()) @@@ Request
x1317422623456004 sent from lustre-OST0003-osc-ffff88073512ec00 to NID
194.24
9.156.130 at tcp 7s ago has timed out (7s prior to deadline).
Oct 27 22:41:40 pikolit req at ffff8805ba4a0c00 x1317422623456004/t0
o103->lustre-OST0003_UUID at 194.249.156.130@tcp:17/18 lens 312/384 e 0 to
1 dl 1256679700 ref 2 fl Rpc:N/0/
0 rc 0/0
Oct 27 22:41:40 pikolit Lustre:
399:0:(client.c:1430:ptlrpc_expire_one_request()) Skipped 12 previous
similar messages
Oct 27 22:41:40 pikolit Lustre: lustre-OST0003-osc-ffff88073512ec00:
Connection to service lustre-OST0003 via nid 194.249.156.130 at tcp was
lost; in progress operations usin
g this service will wait for recovery to complete.
Oct 27 22:41:40 pikolit Lustre: Skipped 8 previous similar messages
Oct 27 22:41:40 pikolit LustreError:
399:0:(ldlm_request.c:1030:ldlm_cli_cancel_req()) Got rc -11 from cancel
RPC: canceling anyway
Oct 27 22:41:40 pikolit LustreError:
399:0:(ldlm_request.c:1030:ldlm_cli_cancel_req()) Skipped 7 previous
similar messages
Oct 27 22:41:40 pikolit LustreError:
399:0:(ldlm_request.c:1586:ldlm_cli_cancel_list()) ldlm_cli_cancel_list: -11
Oct 27 22:41:40 pikolit LustreError:
399:0:(ldlm_request.c:1586:ldlm_cli_cancel_list()) Skipped 7 previous
similar messages
Oct 27 22:41:41 pikolit Lustre:
28152:0:(lib-move.c:1795:lnet_parse_put()) Dropping PUT from
12345-194.249.156.130 at tcp portal 4 match 1317422623455903 offset 192
length 41
6: 2
Oct 27 22:41:41 pikolit Lustre:
28152:0:(lib-move.c:1795:lnet_parse_put()) Dropping PUT from
12345-194.249.156.130 at tcp portal 4 match 1317422623453859 offset 192
length 41
6: 2
Oct 27 22:41:41 pikolit Lustre:
28152:0:(lib-move.c:1795:lnet_parse_put()) Dropping PUT from
12345-194.249.156.130 at tcp portal 4 match 1317422623454918 offset 192
length 41
6: 2
Oct 27 22:41:41 pikolit Lustre:
28152:0:(lib-move.c:1795:lnet_parse_put()) Dropping PUT from
12345-194.249.156.130 at tcp portal 4 match 1317422623454640 offset 192
length 41
6: 2
Oct 27 22:41:41 pikolit Lustre:
28152:0:(lib-move.c:1795:lnet_parse_put()) Dropping PUT from
12345-194.249.156.130 at tcp portal 4 match 1317422623454554 offset 192
length 41
6: 2
Oct 27 22:41:41 pikolit Lustre:
28152:0:(lib-move.c:1795:lnet_parse_put()) Dropping PUT from
12345-194.249.156.130 at tcp portal 4 match 1317422623454424 offset 192
length 416: 2
Oct 27 22:41:41 pikolit Lustre:
28152:0:(lib-move.c:1795:lnet_parse_put()) Dropping PUT from
12345-194.249.156.130 at tcp portal 4 match 1317422623454296 offset 192
length 416: 2
Oct 27 22:41:41 pikolit Lustre:
28152:0:(lib-move.c:1795:lnet_parse_put()) Dropping PUT from
12345-194.249.156.130 at tcp portal 4 match 1317422623454159 offset 192
length 416: 2
Oct 27 22:41:41 pikolit Lustre:
28152:0:(lib-move.c:1795:lnet_parse_put()) Dropping PUT from
12345-194.249.156.130 at tcp portal 4 match 1317422623453984 offset 192
length 416: 2
Oct 27 22:41:41 pikolit Lustre:
28152:0:(lib-move.c:1795:lnet_parse_put()) Dropping PUT from
12345-194.249.156.130 at tcp portal 18 match 1317422623456004 offset 192
length 192: 2
Oct 27 22:41:41 pikolit Lustre:
28152:0:(lib-move.c:1834:lnet_parse_get()) Dropping GET from
12345-194.249.156.130 at tcp portal 8 match 1317422623456018 offset 0
length 917504
Oct 27 22:41:41 pikolit Lustre: lustre-OST0003-osc-ffff88073512ec00:
Connection restored to service lustre-OST0003 using nid 194.249.156.130 at tcp.


There are no disk or filesystem related errors, nor are there problems
with the network.
The Lustre consists of 3 SLC5 (RHEL5 compatible distribution) servers,
all of them having OSTs and one of them MDT as well. They were recently
upgraded to 1.8.1.1, the problem was however present before (with
1.8.0). The clients (the writing one included) have 2.6.30 kernel with
modules for unpatched kernel last compiled from b1_8 branch a couple
days ago.
Is this a known problem with high load?

Cheers,
Dejan



More information about the lustre-discuss mailing list