[Lustre-discuss] Slow acess or hang state of lustre filesystem on client machines

faheem patel pfaheem at gmail.com
Mon Sep 19 02:10:27 PDT 2011


Hi All,

Thanks in advance.
We  are new to this lustre filesystem.

We have installed some 30TB of lustre filesystem configured on client
systems.

We have  2 MDC servers (MDS)  which are in HA mode with IB interface and  1
/mdc filesystem mounted with 500GB of size.

we have 2 OSS servers with HA configured between them.
1 OSS server had 8 OST's filesystem.
And 2nd OSS server had 9 OST's filesystem.
i.e total of 17 OST's distributed between 2 OSS servers which are configured
in HA with bond of IB interface on both servers.

All lustre clients and Oss and MDS servers are all having IB (infiniband)
Network.

we are getiing the following error messages on my OSS servers and also on my
client machine for the past week.



----------------------------------------------------------------------------------------------------------------------------------------

 *Lustre Server OSS error logs*

Sep 19 11:08:57 oss1 kernel: Lustre:
15007:0:(ldlm_lib.c:872:target_handle_connect()) lustre-OST0006:
refuse reconnection from
65820f02-c4f0-e79a-4778-15a9b4653a88 at 10.148.0.2@o2ib to
0xffff8806320e1800; still busy with 1 active RPCs
Sep 19 11:08:57 oss1 kernel: Lustre:
15007:0:(ldlm_lib.c:872:target_handle_connect()) Skipped 1 previous
similar message
Sep 19 11:09:18 oss1 kernel: Lustre:
13143:0:(ldlm_lib.c:572:target_handle_reconnect()) lustre-OST0006:
65820f02-c4f0-e79a-4778-15a9b4653a88 reconnecting
Sep 19 11:09:18 oss1 kernel: Lustre:
13143:0:(ldlm_lib.c:572:target_handle_reconnect()) Skipped 18 previous
similar messages
Sep 19 11:09:26 oss1 kernel: Lustre:
13255:0:(client.c:1476:ptlrpc_expire_one_request()) @@@ Request
x1380348987462916 sent from lustre-OST0008 to NID 10.148.0.2 at o2ib 7s
ago has timed out (7s prior to deadline).
Sep 19 11:09:26 oss1 kernel:   req at ffff880631af7400
x1380348987462916/t0 o104->@NET_0x500000a940002_UUID:15/16 lens
296/384 e 0 to 1 dl 1316410765 ref 2 fl Rpc:N/0/0 rc 0/0
Sep 19 11:09:26 oss1 kernel: LustreError: 138-a: lustre-OST0008: A
client on nid 10.148.0.2 at o2ib was evicted due to a lock blocking
callback to 10.148.0.2 at o2ib timed out: rc -107
Sep 19 11:09:26 oss1 kernel: LustreError:
13122:0:(client.c:841:ptlrpc_import_delay_req()) @@@ IMP_CLOSED
req at ffff880312465c00 x1380348987462920/t0
o105->@NET_0x500000a940002_UUID:15/16 lens 344/384 e 0 to 1 dl 0 ref 1
fl Rpc:N/0/0 rc 0/0
Sep 19 11:09:26 oss1 kernel: LustreError:
13122:0:(ldlm_lockd.c:595:ldlm_handle_ast_error()) ### client (nid
10.148.0.2 at o2ib) returned 0 from completion AST ns:
filter-lustre-OST0008_UUID lock: ffff880629826c00/0x7f1137a31caded22
lrc: 3/0,0 mode: PW/PW res: 10165838/0 rrc: 3 type: EXT
[0->18446744073709551615] (req 0->18446744073709551615) flags: 0x0
remote: 0xb4433ce500b30ffb expref: 440 pid: 13255 timeout 0
Sep 19 11:09:38 oss1 kernel: LustreError:
13117:0:(ldlm_lockd.c:1824:ldlm_cancel_handler()) operation 103 from
12345-10.148.0.2 at o2ib with bad export cookie 9156160691120629456
Sep 19 11:26:58 oss1 gdm-session-worker[15599]: PAM pam_putenv: NULL
pam handle passed
------------------------------------------------------------------------------------------------------*Lustre
Client error log messages..*

Sep 19 10:38:43 service0 kernel: [ 6094.583298]   req at ffff880b0f7f0800
x1380348451643117/t0 o8->lustre-OST0007_UUID at 10.148.0.107@o2ib:28/4
lens 368/584 e 0 to 1 dl 1316408923 ref 2 fl Rpc:N/0/0 rc 0/0
Sep 19 10:38:43 service0 kernel: [ 6094.583305] Lustre:
8565:0:(client.c:1476:ptlrpc_expire_one_request()) Skipped 6 previous
similar messages
Sep 19 10:38:44 service0 kernel: [ 6095.582711] Lustre:
8566:0:(import.c:517:import_select_connection())
lustre-OST0001-osc-ffff8806234ee000: tried all connections, increasing
latency to 8s
Sep 19 10:38:46 service0 kernel: [ 6097.355378] LustreError: 11-0: an
error occurred while communicating with 10.148.0.107 at o2ib. The
ost_connect operation failed with -16
Sep 19 10:38:46 service0 kernel: [ 6097.355381] LustreError: Skipped
20 previous similar messages
Sep 19 10:38:58 service0 kernel: [ 6109.582174] Lustre:
lustre-OST0001-osc-ffff8806234ee000: Connection restored to service
lustre-OST0001 using nid 10.148.0.107 at o2ib.
Sep 19 10:39:55 service0 kernel: [ 6166.617376] Lustre:
14902:0:(client.c:1476:ptlrpc_expire_one_request()) @@@ Request
x1380348451645667 sent from lustre-OST0000-osc-ffff8806234ee000 to NID
10.148.0.106 at o2ib 14s ago has timed out (14s prior to deadline).
Sep 19 10:39:55 service0 kernel: [ 6166.617381]   req at ffff8805f69bc800
x1380348451645667/t0 o101->lustre-OST0000_UUID at 10.148.0.106@o2ib:28/4
lens 296/544 e 0 to 1 dl 1316408995 ref 2 fl Rpc:/0/0 rc 0/0
Sep 19 10:39:55 service0 kernel: [ 6166.617390] Lustre:
14902:0:(client.c:1476:ptlrpc_expire_one_request()) Skipped 8 previous
similar messages
Sep 19 10:39:55 service0 kernel: [ 6166.617402] Lustre:
lustre-OST0000-osc-ffff8806234ee000: Connection to service
lustre-OST0000 via nid 10.148.0.106 at o2ib was lost; in progress
operations using this service will wait for recovery to complete.
Sep 19 10:39:55 service0 kernel: [ 6166.617406] Lustre: Skipped 8
previous similar messages
Sep 19 10:39:56 service0 sshd[14904]: Accepted
keyboard-interactive/pam for root from 192.9.70.32 port 33623 ssh2
Sep 19 10:40:08 service0 kernel: [ 6179.616393] Lustre:
8565:0:(client.c:1476:ptlrpc_expire_one_request()) @@@ Request
x1380348451645687 sent from lustre-OST0000-osc-ffff8806234ee000 to NID
10.148.0.106 at o2ib 13s ago has timed out (13s prior to deadline).
Sep 19 10:40:08 service0 kernel: [ 6179.616396]   req at ffff880af48e5000
x1380348451645687/t0 o8->lustre-OST0000_UUID at 10.148.0.106@o2ib:28/4
lens 368/584 e 0 to 1 dl 1316409008 ref 2 fl Rpc:N/0/0 rc 0/0
Sep 19 10:40:09 service0 kernel: [ 6180.616338] Lustre:
8566:0:(import.c:517:import_select_connection())
lustre-OST0000-osc-ffff8806234ee000: tried all connections, increasing
latency to 9s
Sep 19 10:40:09 service0 kernel: [ 6180.616344] Lustre:
8566:0:(import.c:517:import_select_connection()) Skipped 8 previous
similar messages
Sep 19 10:40:16 service0 kernel: [ 6187.219814] Lustre:
8564:0:(client.c:1476:ptlrpc_expire_one_request()) @@@ Request
x1380348451645670 sent from lustre-OST0000-osc-ffff8806234ee000 to NID
10.148.0.106 at o2ib 31s ago has timed out (31s prior to deadline).
Sep 19 10:40:16 service0 kernel: [ 6187.219818]   req at ffff880b27740800
x1380348451645670/t0 o400->lustre-OST0000_UUID at 10.148.0.106@o2ib:28/4
lens 192/384 e 0 to 1 dl 1316409016 ref 1 fl Rpc:N/0/0 rc 0/0
Sep 19 10:40:16 service0 kernel: [ 6187.219843] Lustre:
lustre-OST0003-osc-ffff8806234ee000: Connection to service
lustre-OST0003 via nid 10.148.0.106 at o2ib was lost; in progress
operations using this service will wait for recovery to complete.
Sep 19 10:40:21 service0 kernel: [ 6192.219448] Lustre:
lustre-OST0010-osc-ffff8806234ee000: Connection to service
lustre-OST0010 via nid 10.148.0.106 at o2ib was lost; in progress
operations using this service will wait for recovery to complete.
Sep 19 10:40:21 service0 kernel: [ 6192.219453] Lustre: Skipped 2
previous similar messages
Sep 19 10:40:24 service0 kernel: [ 6195.615180] Lustre:
8566:0:(import.c:517:import_select_connection())
lustre-OST0000-osc-ffff8806234ee000: tried all connections, increasing
latency to 10s
Sep 19 10:40:25 service0 kernel: [ 6196.029170] LustreError: 11-0: an
error occurred while communicating with 10.148.0.106 at o2ib. The
ost_connect operation failed with -16
Sep 19 10:40:25 service0 kernel: [ 6196.029174] LustreError: Skipped 8
previous similar messages
Sep 19 10:40:25 service0 kernel: [ 6196.029345] Lustre:
lustre-OST0000-osc-ffff8806234ee000: Connection restored to service
lustre-OST0000 using nid 10.148.0.106 at o2ib.
Sep 19 10:40:25 service0 kernel: [ 6196.029349] Lustre: Skipped 8
previous similar messages




Thanks and Regards,

Faheem PAtel
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20110919/05908728/attachment.htm>


More information about the lustre-discuss mailing list