[Lustre-discuss] Odd performance issue with 1.4.x OSS ...

Steden Klaus Klaus.Steden at thomson.net
Mon Nov 3 11:39:56 PST 2008


Hello Andreas,

Thanks for the info. Here's some more information (basically cut and pasted from before and after a typical stack dump message):

-- cut --
LustreError: 20438:0:(service.c:648:ptlrpc_server_handle_request()) request 4777659 opc 101 from 12345-10.0.0.249 at vib processed in 118s trans 0 rc 0/0
Lustre: 20438:0:(watchdog.c:302:lcw_update_time()) Expired watchdog for pid 20438 disabled after 118.9719s
LustreError: 20257:0:(ldlm_lockd.c:579:ldlm_server_completion_ast()) ### enqueue wait took 118984007us from 1225493396 ns: filter-ost13_UUID lock: 0000010036031dc0/0x7bef85be5c78b145 lrc: 2/0,0 mode: PW/PW res: 15229751/0 rrc: 3 type: 
EXT [0->18446744073709551615] (req 8482816->8503295) flags: 20 remote: 0x84d9c08a99ff4b23 expref: 113 pid: 20438
LustreError: 20500:0:(client.c:940:ptlrpc_expire_one_request()) @@@ timeout (sent at 1225493514, 20s ago) req at 00000100cb900e00 x1151/t0 o104->@NET_0xb00000a0000fa_UUID:15 lens 176/64 ref 2 fl Rpc:/0/0 rc 0/0
LustreError: A client on nid 10.0.0.250 at vib was evicted from service ost5.
LustreError: 20500:0:(ldlm_lockd.c:423:ldlm_failed_ast()) ### blocking AST failed (-110): evicting client 36604_tiger_lov_7a82cb6234 at NET_0xb00000a0000fa_UUID NID 10.0.0.250 at vib (10.0.0.250 at vib) ns: filter-ost5_UUID lock: 000001008f840c
40/0x7bef85be5c78c80c lrc: 2/0,0 mode: PW/PW res: 15158480/0 rrc: 57 type: EXT [4651483136->4802478079] (req 4651483136->4654473215) flags: 20 remote: 0x789bd2e5e1ada437 expref: 120 pid: 20579
Lustre: 20107:0:(lib-move.c:1624:lnet_parse_put()) Dropping PUT from 12345-10.0.0.250 at vib portal 16 match 1151 offset 0 length 64: 2
LustreError: 20269:0:(ldlm_lockd.c:1417:ldlm_cancel_handler()) operation 103 from 12345-10.0.0.250 at vib with bad export cookie 8930503638740276498
LustreError: 20580:0:(ldlm_lib.c:1272:target_send_reply_msg()) @@@ processing error (-107) req at 0000010060cc0000 x7055836/t0 o101-><?>@<?>:-1 lens 176/0 ref 0 fl Interpret:/0/0 rc -107/0
LustreError: 0:0:(ldlm_lockd.c:205:waiting_locks_callback()) ### lock callback timer expired: evicting client 7cb21_tiger_lov_7afb88fa94 at NET_0xb00000a0000f9_UUID nid 10.0.0.249 at vib  ns: filter-ost13_UUID lock: 0000010036031dc0/0x7bef85
be5c78b145 lrc: 1/0,0 mode: PW/PW res: 15229751/0 rrc: 2 type: EXT [0->18446744073709551615] (req 8482816->8503295) flags: 20 remote: 0x84d9c08a99ff4b23 expref: 117 pid: 20438
LustreError: 20452:0:(ldlm_lib.c:1272:target_send_reply_msg()) @@@ processing error (-107) req at 0000010064e22200 x4778674/t0 o101-><?>@<?>:-1 lens 176/0 ref 0 fl Interpret:/0/0 rc -107/0
Lustre: 0:0:(watchdog.c:121:lcw_cb()) Watchdog triggered for pid 20559: it was inactive for 100000ms
Lustre: 0:0:(linux-debug.c:155:libcfs_debug_dumpstack()) showing stack for process 20559
ll_ost_220    S 000001006e854008     0 20559      1         20560 20558 (L-TLB)
000001009739f3e8 0000000000000046 000000000000000f ffffffffa05b33b8 
       0000000000000548 0000000100000000 0000000000000000 000001006e8540b0 
       0000000000000013 0000000000000000 
Call Trace:<ffffffffa05b33b8>{:ptlrpc:ptl_send_buf+824} <ffffffff801454bd>{__mod_timer+317} 
       <ffffffff8033860d>{schedule_timeout+381} <ffffffff801460a0>{process_timeout+0} 
       <ffffffffa05ade84>{:ptlrpc:ptlrpc_queue_wait+6932} 
       <ffffffffa055927d>{:ptlrpc:l_has_lock+77} <ffffffffa058676c>{:ptlrpc:ldlm_add_waiting_lock+2156} 
       <ffffffff80137be0>{default_wake_function+0} <ffffffffa05a9620>{:ptlrpc:expired_request+0} 
       <ffffffffa05a96e0>{:ptlrpc:interrupted_request+0} <ffffffffa058bae8>{:ptlrpc:ldlm_server_blocking_ast+4072} 
       <ffffffffa055ef5a>{:ptlrpc:ldlm_run_ast_work+234} <ffffffffa05591f8>{:ptlrpc:l_unlock+248} 
       <ffffffffa0571c73>{:ptlrpc:ldlm_process_extent_lock+995} 
       <ffffffffa0562b4f>{:ptlrpc:ldlm_lock_enqueue+1087} 
       <ffffffffa058bbb0>{:ptlrpc:ldlm_server_completion_ast+0} 
       <ffffffffa058cb50>{:ptlrpc:ldlm_server_glimpse_ast+0} 
       <ffffffffa058f694>{:ptlrpc:ldlm_handle_enqueue+8836} 
       <ffffffffa068c485>{:obdfilter:filter_fmd_expire+53} 
       <ffffffffa058ab00>{:ptlrpc:ldlm_server_blocking_ast+0} 
       <ffffffffa058bbb0>{:ptlrpc:ldlm_server_completion_ast+0} 
       <ffffffffa0641bb2>{:ost:ost_handle+20738} <ffffffff801f787e>{number+142} 
       <ffffffff801f7f09>{vsnprintf+1321} <ffffffffa049f022>{:libcfs:libcfs_debug_msg+1554} 
       <ffffffffa05bc93a>{:ptlrpc:ptlrpc_server_handle_request+3066} 
       <ffffffff80116d35>{do_gettimeofday+101} <ffffffffa04a114e>{:libcfs:lcw_update_time+30} 
       <ffffffff801454bd>{__mod_timer+317} <ffffffffa05bdd17>{:ptlrpc:ptlrpc_main+2375} 
       <ffffffff80137be0>{default_wake_function+0} <ffffffffa05bd3c0>{:ptlrpc:ptlrpc_retry_rqbds+0} 
       <ffffffffa05bd3c0>{:ptlrpc:ptlrpc_retry_rqbds+0} <ffffffff801114ab>{child_rip+8} 
       <ffffffffa05bd3d0>{:ptlrpc:ptlrpc_main+0} <ffffffff801114a3>{child_rip+0} 
       
LustreError: dumping log to /tmp/lustre-log-tiger-oss-0-0.local.1225493885.20559
LustreError: 20526:0:(client.c:940:ptlrpc_expire_one_request()) @@@ timeout (sent at 1225493865, 20s ago) req at 000001001986c000 x1346/t0 o106->@NET_0xb00000a0000ed_UUID:15 lens 176/112 ref 2 fl Rpc:/0/0 rc 0/0
LustreError: A client on nid 10.0.0.237 at vib was evicted from service ost1.
LustreError: 20526:0:(ldlm_lockd.c:423:ldlm_failed_ast()) ### glimpse AST failed (-110): evicting client 82ed7_tiger_lov_eea68aa0e0 at NET_0xb00000a0000ed_UUID NID 10.0.0.237 at vib (10.0.0.237 at vib) ns: filter-ost1_UUID lock: 000001010f17dac
0/0x7bef85be5c792519 lrc: 2/0,0 mode: PW/PW res: 15168088/0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->4095) flags: 0 remote: 0x3dd1a37046787550 expref: 182 pid: 20464
LustreError: 20702:0:(ost_handler.c:955:ost_brw_write()) @@@ Eviction on bulk GET req at 0000010061543400 x4033002/t0 o4->82ed7_tiger_lov_eea68aa0e0 at NET_0xb00000a0000ed_UUID:-1 lens 328/288 ref 0 fl Interpret:/0/0 rc 0/0
LustreError: 20866:0:(ost_handler.c:747:ost_brw_read()) @@@ timeout on bulk PUT req at 0000010055099050 x4032963/t0 o3->82ed7_tiger_lov_eea68aa0e0 at NET_0xb00000a0000ed_UUID:-1 lens 328/280 ref 0 fl Interpret:/0/0 rc 0/0
LustreError: 20866:0:(ost_handler.c:747:ost_brw_read()) Skipped 4 previous similar messages
Lustre: 20866:0:(ost_handler.c:813:ost_brw_read()) ost13: ignoring bulk IO comm error with 82ed7_tiger_lov_eea68aa0e0 at NET_0xb00000a0000ed_UUID id 12345-10.0.0.237 at vib - client will retry
Lustre: 20866:0:(ost_handler.c:813:ost_brw_read()) Skipped 4 previous similar messages
Lustre: 20702:0:(ost_handler.c:1050:ost_brw_write()) ost1: ignoring bulk IO comm error with 82ed7_tiger_lov_eea68aa0e0 at NET_0xb00000a0000ed_UUID id 12345-10.0.0.237 at vib - client will retry
Lustre: 20106:0:(lib-move.c:1728:lnet_parse_reply()) 10.0.0.230 at vib: Dropping REPLY from 12345-10.0.0.237 at vib for invalid MD 0x45a932db4532a.0xb4ef9
Lustre: 20106:0:(lib-move.c:1624:lnet_parse_put()) Dropping PUT from 12345-10.0.0.237 at vib portal 16 match 1346 offset 0 length 112: 2
LustreError: 20592:0:(ldlm_lib.c:1272:target_send_reply_msg()) @@@ processing error (-107) req at 000001010209d800 x4033039/t0 o101-><?>@<?>:-1 lens 176/0 ref 0 fl Interpret:/0/0 rc -107/0
LustreError: 20559:0:(service.c:648:ptlrpc_server_handle_request()) request 4510790 opc 101 from 12345-10.0.0.232 at vib processed in 109s trans 0 rc 0/0
-- cut --

I'm working on urging the clients to upgrade to 1.6 but it's slow going.

Any insight would be very helpful ... I can probably get on top of the fix if I have an inkling of where to start looking ... at this point, I don't, I just get reports that it's "slow".

thanks,
Klaus

PS Sorry for the massive paste to everyone else on the list.

-----Original Message-----
From: Andreas.Dilger at sun.com on behalf of Andreas Dilger
Sent: Mon 11/3/2008 9:47 AM
To: Steden Klaus
Cc: lustre-discuss at lists.lustre.org
Subject: Re: [Lustre-discuss] Odd performance issue with 1.4.x OSS ...
 
On Oct 31, 2008  13:02 -0700, Steden Klaus wrote:
> Our Lustre started exhibiting some curious performance issues today
> ... basically, it slowed down dramatically and reliable I/O performance
> became impossible. I looked through the output of dmesg and saw a number
> of kernel 'oops' messages, but not being a Lustre kernel expert, I'm
> not exactly sure what they indicate. I stopped the OSTs on the node in
> question and ran e2fsck on the OST drives, but they've come up clean so
> I don't think it's a hardware problem. I don't have physical access to
> the machine right now so it may in fact be something on the back end,
> but I'm working on verifying that with a technician on site. In the
> meantime ... can anyone help decipher this for me? There are a couple
> of messages like it:

These kind of messages are of relatively little use unless they include
some of the preceding lines.  Are you sure this is an oops, and not a
watchdog timeout that is dumping the stack?

> -- cut --
> ll_ost_215    S 00000100d2141808     0  8584      1          8585  8583 (L-TLB)
> 00000101184233e8 0000000000000046 000000000000000f ffffffffa059c3b8 
>        00000000005c2616 0000000100000000 0000000000000000 00000100d21418b0 
>        0000000000000013 0000000000000000 
> Call Trace:<ffffffffa059c3b8>{:ptlrpc:ptl_send_buf+824} <ffffffff801454bd>{__mod_timer+317} 
>        <ffffffff8033860d>{schedule_timeout+381} <ffffffff801460a0>{process_timeout+0} 
>        <ffffffffa0596e84>{:ptlrpc:ptlrpc_queue_wait+6932} 

This looks like a network problem, but hard to say without more information.
If you are a supported customer, you will have better service by filing a
bugzilla bug.  This list only gets "as available" replies and that is
doubly true for old 1.4 Lustre installations.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.





More information about the lustre-discuss mailing list