[Lustre-discuss] Odd performance issue with 1.4.x OSS ...

Klaus Steden klaus.steden at thomson.net
Fri Nov 7 15:03:00 PST 2008


On 11/7/08 2:46 PM, "Andreas Dilger" <adilger at sun.com> etched on stone
tablets:

> On Nov 03, 2008  11:39 -0800, Steden Klaus wrote:
>> Hello Andreas,
>> 
>> Thanks for the info. Here's some more information (basically cut and pasted
>> from before and after a typical stack dump message):
>> 
>> -- cut --
>> Lustre: 0:0:(watchdog.c:121:lcw_cb()) Watchdog triggered for pid 20559: it
>> was inactive for 100000ms
>> Lustre: 0:0:(linux-debug.c:155:libcfs_debug_dumpstack()) showing stack for
>> process 20559
>> ll_ost_220    S 000001006e854008     0 20559      1         20560 20558
>> (L-TLB)
>> 000001009739f3e8 0000000000000046 000000000000000f ffffffffa05b33b8
>>        0000000000000548 0000000100000000 0000000000000000 000001006e8540b0
>>        0000000000000013 0000000000000000
>> Call Trace:<ffffffffa05b33b8>{:ptlrpc:ptl_send_buf+824}
>> <ffffffff801454bd>{__mod_timer+317}
>>        <ffffffff8033860d>{schedule_timeout+381}
>> <ffffffff801460a0>{process_timeout+0}
> 
> This is a watchdog timeout, and should not be thought of as an oops or panic
> as you mentioned in your earlier comment.  It is debugging for Lustre.
> 
>> LustreError: 20438:0:(service.c:648:ptlrpc_server_handle_request()) request
>> 4777659 opc 101 from 12345-10.0.0.249 at vib processed in 118s trans 0 rc 0/0
>> Lustre: 20438:0:(watchdog.c:302:lcw_update_time()) Expired watchdog for pid
>> 20438 disabled after 118.9719s
>> LustreError: 20257:0:(ldlm_lockd.c:579:ldlm_server_completion_ast()) ###
>> enqueue wait took 118984007us from 1225493396 ns: filter-ost13_UUID lock:
>> 0000010036031dc0/0x7bef85be5c78b145 lrc: 2/0,0 mode: PW/PW res: 15229751/0
>> rrc: 3 type: 
> 
> This looks like slow request processing on the OST.  I would suggest to
> increase the lustre timeout value to 150s or more.  Timeout values of
> 300s is fairly common in larger installations (1000 nodes or more).  In
> the 1.8 release Lustre will automatically tune this value so that it
> can adapt to larger-scale systems better.
> 
Hi Andreas,

That is good to know. I thought it was an oops message, not normal
debugging. I'm used to Lustre "just working" without this kind of output, so
I assumed the worst when I saw them.

As far as the issue itself, apparently the Voltaire switch needed a restart,
which has cleared whatever condition was causing the problem.

thanks,
Klaus 




More information about the lustre-discuss mailing list