[Lustre-discuss] strange slowdown

Aaron Knister aaron at iges.org
Thu Dec 13 12:12:49 PST 2007


Thanks for your help! I have some more information from the lctl dk--

10000000:01000000:3:1197576228.177725:0:8816:0:(mgc_request.c: 
1130:mgc_process_log()) Can't get cfg lock: -108
10000000:01000000:1:1197576228.177727:0:8511:0:(mgc_request.c: 
558:mgc_blocking_ast()) Lock res 0x61746164 (data)
00000100:00020000:3:1197576228.177728:0:8816:0:(client.c: 
710:ptlrpc_import_delay_req()) @@@ IMP_INVALID  req at ffff8103dba84c00  
x390/t0 o501->MGS at MGC192.168.64.70@o2ib_0:26/25 lens 200/304 e 0 to 11  
dl 0 ref 1 fl Rpc:/8/0 rc 0/0
10000000:01000000:1:1197576228.177729:0:8511:0:(mgc_request.c: 
583:mgc_blocking_ast()) log data-OST0000: original grant failed, will  
requeue later
10000000:01000000:3:1197576228.177731:0:8816:0:(mgc_request.c: 
1182:mgc_process_log()) MGC192.168.64.70 at o2ib: configuration from log  
'data-OST0000' failed (-108).
00000100:00080000:1:1197576236.900462:0:8444:0:(pinger.c: 
143:ptlrpc_pinger_main()) not pinging MGS (in recovery: FULL or  
recovery disabled: 0/1)

This is on the OSS.

Also on the OSS --

00010000:00000400:2:1197576684.886679:0:8597:0:(ldlm_lib.c: 
515:target_handle_reconnect()) data-OST0005: 532a7ed7-8e93-e086-885a- 
b064e46adb12  
reconnecting00010000:00000400:2:1197576684.886683:0:8597:0:(ldlm_lib.c: 
744:target_handle_connect()) data-OST0005: refuse reconnection from 532a7ed7-8e93-e086-885a-b064e46adb12 at 192.168.64.102 
@o2ib to 0xffff8103cc9e3000; st
ill busy with 9 active  
RPCs00000100:00100000:1:1197576684.886683:0:8599:0:(service.c: 
1032:ptlrpc_server_handle_request()) Handling RPC pname:cluuid 
+ref:pid:xid:nid:opc ll_ost_55:532a7ed7-8e93-e086-885a- 
b064e46adb12+6:3962:x868:12345-192
.168.64.102 at o2ib:40000000010:00000002:1:1197576684.886687:0:8599:0: 
(ost_handler.c:1598:ost_handle()) @@@ ping  req at ffff81042f7a3c00 x868/ 
t0 o400->532a7ed7-8e93-e086-885a-b064e46adb12 at NET_0x50000c0a84066_UUID: 
0/0 lens 128/0 e 0 to
  0 dl 1197576784 ref 1 fl Interpret:/0/0 rc  
0/000010000:00020000:2:1197576684.886688:0:8597:0:(ldlm_lib.c: 
1458:target_send_reply_msg()) @@@ processing error (-16)   
req at ffff8104167fe850 x871/t0 o8->532a7ed7-8e93-e086-885a- 
b064e46adb12 at NET_0x50000c0a84066_UU
ID:0/0 lens 304/200 e 0 to 0 dl 1197576784 ref 1 fl Interpret:/0/0 rc  
-16/0

On the client it shows --

00000100:00080000:0:1197576416.143577:0:3964:0:(recover.c: 
54:ptlrpc_initiate_recovery()) data-OST0004_UUID: starting recovery
00000100:00080000:0:1197576416.143585:0:3964:0:(import.c: 
381:ptlrpc_connect_import()) ffff81082f49a000 data-OST0004_UUID:  
changing import state from DISCONN to CONNECTING
00000100:00080000:0:1197576416.143590:0:3964:0:(import.c: 
275:import_select_connection()) data-OST0004-osc-ffff81082ae12400:  
connect to NID 192.168.64.71 at o2ib last attempt 4296998987
00000100:00080000:0:1197576416.143597:0:3964:0:(import.c: 
339:import_select_connection()) data-OST0004-osc-ffff81082ae12400:  
import ffff81082f49a000 using connection 192.168.64.71 at o2ib/ 
192.168.64.71 at o2ib
00000100:02020000:0:1197576416.143864:0:3963:0:(client.c: 
581:ptlrpc_check_status()) 11-0: an error occurred while communicating  
with 192.168.64.71 at o2ib. The ost_connect operation failed with -16
00000100:00080000:0:1197576416.144314:0:3963:0:(import.c: 
759:ptlrpc_connect_interpret()) ffff81082f49a000 data-OST0004_UUID:  
changing import state from CONNECTING to DISCONN
00000100:00080000:0:1197576416.144316:0:3963:0:(import.c: 
801:ptlrpc_connect_interpret()) recovery of data-OST0004_UUID on  
192.168.64.71 at o2ib failed (-16)

I'm at a loss.

On Dec 13, 2007, at 11:59 AM, Oleg Drokin wrote:

> Hello!
>
> On Dec 13, 2007, at 11:48 AM, Aaron Knister wrote:
>
>> On the client i see this --
>
> This shows no activity aside from the fact that client is  
> disconnected from OST5.
>
>> and on the server --
>
> This one shows that served does not allow client reconnection  
> because it is still
> busy processing other requests from this client. That's the reason  
> for "mount hang".
>
> This is all I can tell from those logs you provided. If the logs  
> actually span
> long in the past, might be there is more useful info.
> Since there was disconnection - perhaps dmesg on client and server  
> contain
> more info about the disconnection reasons, also on server if you do
> sysrq-t, you will see what is going on with those server threads  
> that are supposedly
> still process client requests.
>
> Bye,
>    Oleg

Aaron Knister
Associate Systems Administrator/Web Designer
Center for Research on Environment and Water

(301) 595-7001
aaron at iges.org






More information about the lustre-discuss mailing list