[Lustre-discuss] "unexpectedly long timeout"

Sun Nov 9 01:41:36 PST 2008

Yuriy Umanets wrote:
> Brock Palen wrote:
>   
>> New error I have never seen before, googling didn't fine much other  
>> than an error involving IB. This node has IB, but lustre runs over TCP.
>>
>> Nov  5 02:19:54 nyx668 kernel: Lustre: 4329:0:(niobuf.c: 
>> 305:ptlrpc_unregister_bulk()) @@@ Unexpectedly long timeout: desc  
>> 000001041802f600  req at 0000010005151a00 x1071812/t0 o4->nobackup- 
>> OST000c_UUID at 10.164.3.245@tcp:6/4 lens 384/480 e 0 to 100 dl  
>> 1225842598 ref 2 fl Rpc:X/0/0 rc 0/0Nov  5 02:19:54 nyx668 kernel:  
>> Lustre: 4329:0:(niobuf.c:305:ptlrpc_unregister_bulk()) Skipped 1  
>> previous similar message
>> Nov  5 02:29:54 nyx668 kernel: Lustre: 4329:0:(niobuf.c: 
>> 305:ptlrpc_unregister_bulk()) @@@ Unexpectedly long timeout: desc  
>> 000001041802f600  req at 0000010005151a00 x1071812/t0 o4->nobackup- 
>> OST000c_UUID at 10.164.3.245@tcp:6/4 lens 384/480 e 0 to 100 dl  
>> 1225842598 ref 2 fl Rpc:X/0/0 rc 0/0
>>
>> On the OSS that provides OST000c  The only errors I see from that  
>> node are the usual, 'can't hear from node'
>>
>> Nov  4 18:46:02 oss2 kernel: Lustre: 6426:0:(ost_handler.c: 
>> 1270:ost_brw_write()) nobackup-OST000c: ignoring bulk IO comm error  
>> with 0d8e8d79-bfac-9d81-a345-39aaf2d4bc0e at NET_0x200000aa4029c_UUID id  
>> 12345-10.164.2.156 at tcp - client will retry
>> Nov  4 18:49:42 oss2 kernel: Lustre: nobackup-OST000c: haven't heard  
>> from client 0d8e8d79-bfac-9d81-a345-39aaf2d4bc0e (at  
>> 10.164.2.156 at tcp) in 227 seconds. I think it's dead, and I am  
>> evicting it.
>> Nov  4 18:49:42 oss2 kernel: Lustre: nobackup-OST000d: haven't heard  
>> from client 0d8e8d79-bfac-9d81-a345-39aaf2d4bc0e (at  
>> 10.164.2.156 at tcp) in 227 seconds. I think it's dead, and I am  
>> evicting it.
>>
>> Any thoughts?
>>
>>   
>>     
> Brock,
>
> This usually means network issues. Basically is happens when lnet can't 
> receive bulk unlink callback long time due to sluggish network and 
> Lustre needs to wait in loop abnormally long time (300s) to make sure 
> that bulk is unregistered. It really can't do it another way because an 
> rpcs waiting for bulk unregistering can't be freed, returned to rq pool, 
> etc., that is, can't be used for another purposes before this is done 
> because its buffers still get mapped by lnet and network data may come 
> any next moment. It would be disaster if this happened when rpc is 
> already freed.
>
> Bad thing here is that, such an blocked rpc also blocks other rpcs in 
> set from processing. While I can't suggest how to fix this, because this 
> is not directly to Lustre issues and rather related to network issues, I 
> can assure you that we're working on this issue.
>
> While we can't make it not happen again with such a long bulk 
> unregister, we can make lustre do not block on this and let other rpcs 
> proceed so this could only be visible if such a long unlink happens 
> about umout time that would make you wait longer for it than usually 
> with all sorts messages that some rpcs are still on sending list.
>
> Please look at bug 17310, there was discussed some issue with 
> ptlrpc_unregister_reply() as well as patches provided wotj that issue 
> around.
>
> Thanks.
>   

Brock,

I have actually created the bug  for this thing. Please refer to 
https://bugzilla.lustre.org/show_bug.cgi?id=17631. Hope to work this out 
shortly.

Thanks
>> Brock Palen
>> www.umich.edu/~brockp
>> Center for Advanced Computing
>> brockp at umich.edu
>> (734)936-1985
>>
>>
>>
>> _______________________________________________
>> Lustre-discuss mailing list
>> Lustre-discuss at lists.lustre.org
>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>   
>>     
>
>
>   

-- 
umka