[Lustre-discuss] strange slowdown

Aaron Knister aaron at iges.org
Thu Dec 13 15:51:08 PST 2007


Don't ask me how but it out of the blue resolved itself. I have 0 idea  
what went wrong...

On Dec 13, 2007, at 3:12 PM, Aaron Knister wrote:

> Thanks for your help! I have some more information from the lctl dk--
>
> 10000000:01000000:3:1197576228.177725:0:8816:0:(mgc_request.c:
> 1130:mgc_process_log()) Can't get cfg lock: -108
> 10000000:01000000:1:1197576228.177727:0:8511:0:(mgc_request.c:
> 558:mgc_blocking_ast()) Lock res 0x61746164 (data)
> 00000100:00020000:3:1197576228.177728:0:8816:0:(client.c:
> 710:ptlrpc_import_delay_req()) @@@ IMP_INVALID  req at ffff8103dba84c00
> x390/t0 o501->MGS at MGC192.168.64.70@o2ib_0:26/25 lens 200/304 e 0 to 11
> dl 0 ref 1 fl Rpc:/8/0 rc 0/0
> 10000000:01000000:1:1197576228.177729:0:8511:0:(mgc_request.c:
> 583:mgc_blocking_ast()) log data-OST0000: original grant failed, will
> requeue later
> 10000000:01000000:3:1197576228.177731:0:8816:0:(mgc_request.c:
> 1182:mgc_process_log()) MGC192.168.64.70 at o2ib: configuration from log
> 'data-OST0000' failed (-108).
> 00000100:00080000:1:1197576236.900462:0:8444:0:(pinger.c:
> 143:ptlrpc_pinger_main()) not pinging MGS (in recovery: FULL or
> recovery disabled: 0/1)
>
> This is on the OSS.
>
> Also on the OSS --
>
> 00010000:00000400:2:1197576684.886679:0:8597:0:(ldlm_lib.c:
> 515:target_handle_reconnect()) data-OST0005: 532a7ed7-8e93-e086-885a-
> b064e46adb12
> reconnecting00010000:00000400:2:1197576684.886683:0:8597:0: 
> (ldlm_lib.c:
> 744:target_handle_connect()) data-OST0005: refuse reconnection from 532a7ed7-8e93-e086-885a-b064e46adb12 at 192.168.64.102
> @o2ib to 0xffff8103cc9e3000; st
> ill busy with 9 active
> RPCs00000100:00100000:1:1197576684.886683:0:8599:0:(service.c:
> 1032:ptlrpc_server_handle_request()) Handling RPC pname:cluuid
> +ref:pid:xid:nid:opc ll_ost_55:532a7ed7-8e93-e086-885a-
> b064e46adb12+6:3962:x868:12345-192
> .168.64.102 at o2ib:40000000010:00000002:1:1197576684.886687:0:8599:0:
> (ost_handler.c:1598:ost_handle()) @@@ ping  req at ffff81042f7a3c00 x868/
> t0 o400->532a7ed7-8e93-e086-885a- 
> b064e46adb12 at NET_0x50000c0a84066_UUID:
> 0/0 lens 128/0 e 0 to
>  0 dl 1197576784 ref 1 fl Interpret:/0/0 rc
> 0/000010000:00020000:2:1197576684.886688:0:8597:0:(ldlm_lib.c:
> 1458:target_send_reply_msg()) @@@ processing error (-16)
> req at ffff8104167fe850 x871/t0 o8->532a7ed7-8e93-e086-885a-
> b064e46adb12 at NET_0x50000c0a84066_UU
> ID:0/0 lens 304/200 e 0 to 0 dl 1197576784 ref 1 fl Interpret:/0/0 rc
> -16/0
>
> On the client it shows --
>
> 00000100:00080000:0:1197576416.143577:0:3964:0:(recover.c:
> 54:ptlrpc_initiate_recovery()) data-OST0004_UUID: starting recovery
> 00000100:00080000:0:1197576416.143585:0:3964:0:(import.c:
> 381:ptlrpc_connect_import()) ffff81082f49a000 data-OST0004_UUID:
> changing import state from DISCONN to CONNECTING
> 00000100:00080000:0:1197576416.143590:0:3964:0:(import.c:
> 275:import_select_connection()) data-OST0004-osc-ffff81082ae12400:
> connect to NID 192.168.64.71 at o2ib last attempt 4296998987
> 00000100:00080000:0:1197576416.143597:0:3964:0:(import.c:
> 339:import_select_connection()) data-OST0004-osc-ffff81082ae12400:
> import ffff81082f49a000 using connection 192.168.64.71 at o2ib/
> 192.168.64.71 at o2ib
> 00000100:02020000:0:1197576416.143864:0:3963:0:(client.c:
> 581:ptlrpc_check_status()) 11-0: an error occurred while communicating
> with 192.168.64.71 at o2ib. The ost_connect operation failed with -16
> 00000100:00080000:0:1197576416.144314:0:3963:0:(import.c:
> 759:ptlrpc_connect_interpret()) ffff81082f49a000 data-OST0004_UUID:
> changing import state from CONNECTING to DISCONN
> 00000100:00080000:0:1197576416.144316:0:3963:0:(import.c:
> 801:ptlrpc_connect_interpret()) recovery of data-OST0004_UUID on
> 192.168.64.71 at o2ib failed (-16)
>
> I'm at a loss.
>
> On Dec 13, 2007, at 11:59 AM, Oleg Drokin wrote:
>
>> Hello!
>>
>> On Dec 13, 2007, at 11:48 AM, Aaron Knister wrote:
>>
>>> On the client i see this --
>>
>> This shows no activity aside from the fact that client is
>> disconnected from OST5.
>>
>>> and on the server --
>>
>> This one shows that served does not allow client reconnection
>> because it is still
>> busy processing other requests from this client. That's the reason
>> for "mount hang".
>>
>> This is all I can tell from those logs you provided. If the logs
>> actually span
>> long in the past, might be there is more useful info.
>> Since there was disconnection - perhaps dmesg on client and server
>> contain
>> more info about the disconnection reasons, also on server if you do
>> sysrq-t, you will see what is going on with those server threads
>> that are supposedly
>> still process client requests.
>>
>> Bye,
>>   Oleg
>
> Aaron Knister
> Associate Systems Administrator/Web Designer
> Center for Research on Environment and Water
>
> (301) 595-7001
> aaron at iges.org
>
>
>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at clusterfs.com
> https://mail.clusterfs.com/mailman/listinfo/lustre-discuss

Aaron Knister
Associate Systems Administrator/Web Designer
Center for Research on Environment and Water

(301) 595-7001
aaron at iges.org






More information about the lustre-discuss mailing list