[Lustre-discuss] Odd performance issue with 1.4.x OSS ...

Fri Nov 7 14:46:56 PST 2008

On Nov 03, 2008  11:39 -0800, Steden Klaus wrote:
> Hello Andreas,
> 
> Thanks for the info. Here's some more information (basically cut and pasted from before and after a typical stack dump message):
> 
> -- cut --
> Lustre: 0:0:(watchdog.c:121:lcw_cb()) Watchdog triggered for pid 20559: it was inactive for 100000ms
> Lustre: 0:0:(linux-debug.c:155:libcfs_debug_dumpstack()) showing stack for process 20559
> ll_ost_220    S 000001006e854008     0 20559      1         20560 20558 (L-TLB)
> 000001009739f3e8 0000000000000046 000000000000000f ffffffffa05b33b8 
>        0000000000000548 0000000100000000 0000000000000000 000001006e8540b0 
>        0000000000000013 0000000000000000 
> Call Trace:<ffffffffa05b33b8>{:ptlrpc:ptl_send_buf+824} <ffffffff801454bd>{__mod_timer+317} 
>        <ffffffff8033860d>{schedule_timeout+381} <ffffffff801460a0>{process_timeout+0} 

This is a watchdog timeout, and should not be thought of as an oops or panic
as you mentioned in your earlier comment.  It is debugging for Lustre.

> LustreError: 20438:0:(service.c:648:ptlrpc_server_handle_request()) request 4777659 opc 101 from 12345-10.0.0.249 at vib processed in 118s trans 0 rc 0/0
> Lustre: 20438:0:(watchdog.c:302:lcw_update_time()) Expired watchdog for pid 20438 disabled after 118.9719s
> LustreError: 20257:0:(ldlm_lockd.c:579:ldlm_server_completion_ast()) ### enqueue wait took 118984007us from 1225493396 ns: filter-ost13_UUID lock: 0000010036031dc0/0x7bef85be5c78b145 lrc: 2/0,0 mode: PW/PW res: 15229751/0 rrc: 3 type: 

This looks like slow request processing on the OST.  I would suggest to
increase the lustre timeout value to 150s or more.  Timeout values of
300s is fairly common in larger installations (1000 nodes or more).  In
the 1.8 release Lustre will automatically tune this value so that it
can adapt to larger-scale systems better.

> I'm working on urging the clients to upgrade to 1.6 but it's slow going.
> 
> Any insight would be very helpful ... I can probably get on top of the fix if I have an inkling of where to start looking ... at this point, I don't, I just get reports that it's "slow".
> 
> thanks,
> Klaus
> 
> PS Sorry for the massive paste to everyone else on the list.
> 
> -----Original Message-----
> From: Andreas.Dilger at sun.com on behalf of Andreas Dilger
> Sent: Mon 11/3/2008 9:47 AM
> To: Steden Klaus
> Cc: lustre-discuss at lists.lustre.org
> Subject: Re: [Lustre-discuss] Odd performance issue with 1.4.x OSS ...
>  
> On Oct 31, 2008  13:02 -0700, Steden Klaus wrote:
> > Our Lustre started exhibiting some curious performance issues today
> > ... basically, it slowed down dramatically and reliable I/O performance
> > became impossible. I looked through the output of dmesg and saw a number
> > of kernel 'oops' messages, but not being a Lustre kernel expert, I'm
> > not exactly sure what they indicate. I stopped the OSTs on the node in
> > question and ran e2fsck on the OST drives, but they've come up clean so
> > I don't think it's a hardware problem. I don't have physical access to
> > the machine right now so it may in fact be something on the back end,
> > but I'm working on verifying that with a technician on site. In the
> > meantime ... can anyone help decipher this for me? There are a couple
> > of messages like it:
> 
> These kind of messages are of relatively little use unless they include
> some of the preceding lines.  Are you sure this is an oops, and not a
> watchdog timeout that is dumping the stack?
> 
> > -- cut --
> > ll_ost_215    S 00000100d2141808     0  8584      1          8585  8583 (L-TLB)
> > 00000101184233e8 0000000000000046 000000000000000f ffffffffa059c3b8 
> >        00000000005c2616 0000000100000000 0000000000000000 00000100d21418b0 
> >        0000000000000013 0000000000000000 
> > Call Trace:<ffffffffa059c3b8>{:ptlrpc:ptl_send_buf+824} <ffffffff801454bd>{__mod_timer+317} 
> >        <ffffffff8033860d>{schedule_timeout+381} <ffffffff801460a0>{process_timeout+0} 
> >        <ffffffffa0596e84>{:ptlrpc:ptlrpc_queue_wait+6932} 
> 
> This looks like a network problem, but hard to say without more information.
> If you are a supported customer, you will have better service by filing a
> bugzilla bug.  This list only gets "as available" replies and that is
> doubly true for old 1.4 Lustre installations.
> 
> Cheers, Andreas
> --
> Andreas Dilger
> Sr. Staff Engineer, Lustre Group
> Sun Microsystems of Canada, Inc.
> 

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.