[Lustre-discuss] "unexpectedly long timeout"
Yuriy Umanets
Yury.Umanets at Sun.COM
Sun Nov 9 01:41:36 PST 2008
Yuriy Umanets wrote:
> Brock Palen wrote:
>
>> New error I have never seen before, googling didn't fine much other
>> than an error involving IB. This node has IB, but lustre runs over TCP.
>>
>> Nov 5 02:19:54 nyx668 kernel: Lustre: 4329:0:(niobuf.c:
>> 305:ptlrpc_unregister_bulk()) @@@ Unexpectedly long timeout: desc
>> 000001041802f600 req at 0000010005151a00 x1071812/t0 o4->nobackup-
>> OST000c_UUID at 10.164.3.245@tcp:6/4 lens 384/480 e 0 to 100 dl
>> 1225842598 ref 2 fl Rpc:X/0/0 rc 0/0Nov 5 02:19:54 nyx668 kernel:
>> Lustre: 4329:0:(niobuf.c:305:ptlrpc_unregister_bulk()) Skipped 1
>> previous similar message
>> Nov 5 02:29:54 nyx668 kernel: Lustre: 4329:0:(niobuf.c:
>> 305:ptlrpc_unregister_bulk()) @@@ Unexpectedly long timeout: desc
>> 000001041802f600 req at 0000010005151a00 x1071812/t0 o4->nobackup-
>> OST000c_UUID at 10.164.3.245@tcp:6/4 lens 384/480 e 0 to 100 dl
>> 1225842598 ref 2 fl Rpc:X/0/0 rc 0/0
>>
>> On the OSS that provides OST000c The only errors I see from that
>> node are the usual, 'can't hear from node'
>>
>> Nov 4 18:46:02 oss2 kernel: Lustre: 6426:0:(ost_handler.c:
>> 1270:ost_brw_write()) nobackup-OST000c: ignoring bulk IO comm error
>> with 0d8e8d79-bfac-9d81-a345-39aaf2d4bc0e at NET_0x200000aa4029c_UUID id
>> 12345-10.164.2.156 at tcp - client will retry
>> Nov 4 18:49:42 oss2 kernel: Lustre: nobackup-OST000c: haven't heard
>> from client 0d8e8d79-bfac-9d81-a345-39aaf2d4bc0e (at
>> 10.164.2.156 at tcp) in 227 seconds. I think it's dead, and I am
>> evicting it.
>> Nov 4 18:49:42 oss2 kernel: Lustre: nobackup-OST000d: haven't heard
>> from client 0d8e8d79-bfac-9d81-a345-39aaf2d4bc0e (at
>> 10.164.2.156 at tcp) in 227 seconds. I think it's dead, and I am
>> evicting it.
>>
>> Any thoughts?
>>
>>
>>
> Brock,
>
> This usually means network issues. Basically is happens when lnet can't
> receive bulk unlink callback long time due to sluggish network and
> Lustre needs to wait in loop abnormally long time (300s) to make sure
> that bulk is unregistered. It really can't do it another way because an
> rpcs waiting for bulk unregistering can't be freed, returned to rq pool,
> etc., that is, can't be used for another purposes before this is done
> because its buffers still get mapped by lnet and network data may come
> any next moment. It would be disaster if this happened when rpc is
> already freed.
>
> Bad thing here is that, such an blocked rpc also blocks other rpcs in
> set from processing. While I can't suggest how to fix this, because this
> is not directly to Lustre issues and rather related to network issues, I
> can assure you that we're working on this issue.
>
> While we can't make it not happen again with such a long bulk
> unregister, we can make lustre do not block on this and let other rpcs
> proceed so this could only be visible if such a long unlink happens
> about umout time that would make you wait longer for it than usually
> with all sorts messages that some rpcs are still on sending list.
>
> Please look at bug 17310, there was discussed some issue with
> ptlrpc_unregister_reply() as well as patches provided wotj that issue
> around.
>
> Thanks.
>
Brock,
I have actually created the bug for this thing. Please refer to
https://bugzilla.lustre.org/show_bug.cgi?id=17631. Hope to work this out
shortly.
Thanks
>> Brock Palen
>> www.umich.edu/~brockp
>> Center for Advanced Computing
>> brockp at umich.edu
>> (734)936-1985
>>
>>
>>
>> _______________________________________________
>> Lustre-discuss mailing list
>> Lustre-discuss at lists.lustre.org
>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>
>>
>
>
>
--
umka
More information about the lustre-discuss
mailing list