[Lustre-discuss] Lustre 1.6.4.1 - client lockup

Harald van Pee pee at hiskp.uni-bonn.de
Fri Jan 25 13:34:16 PST 2008


Hi, 

thats interessting for me, 
can you just try what happens if you delete a large directory
(lots of files with couple of GB total space) from this client?

I have a test cluster with 1.6.4.1 kernel 2.6.18.8 vanilla  
running. The clients are patchless, 
server and clients are rock stable since 
weaks, but I have only one dual opteron machine (others are mostly athlon and 
couple of pentium)
connected with GigE 
which ist a rock solid machine if I don't mount lustre.
If I mount lustre on this machine it crashs all the the time. 
The last crash happens directly after  I tried to delete a large directory 
from this client.
Up to now I thougt I must have done something wrong with the installation of 
this client, because it behaves completly different than the others, but 
maybe I am wrong?

Harald


On Friday 25 January 2008 04:10 pm, Niklas Edmundsson wrote:
> Hi again!
>
> We're able to consistently kill the lustre client with bonnie in
> combination with striping. This is Lustre 1.6.4.1, Debian 2.6.18 amd64
> kernel with lustre patches on both server and clients (ie. not
> patchless client, even though we're pretty sure that it's the same bug
> that bites us using ubuntu 2.6.15 kernel and patchless client).
>
> All machines are dual opterons connected with GigE.
>
> We have 5 servers, 1 MDS with 1 MGS and 1 MDT target and 4 OSS's with
> 2 OST targets (~1.2TB) each.
>
> We're able to consistently cause a lustre client lock-up doing the
> following:
>
> cd /into-lustre-filsystem
> mkdir striped
> lfs setstripe striped 0 -1 -1
> cd striped
> mkdir host1 host2 host3 host4 host5
> for i in host1 host2 host3 host4 host5; do
>    rsh $i "cd $PWD; bonnie++ -d $i -n 60:0:0:30 > res.$i 2>&1" &
> done
> After 10-15 minutes it locks up with the following stacktrace:
> ========
> Jan 25 11:16:23 BUG: soft lockup detected on CPU#1!
> Jan 25 11:16:23
> Jan 25 11:16:23 Call Trace:
> Jan 25 11:16:23  <IRQ> [<ffffffff80263eec>] softlockup_tick+0xfc/0x120
> Jan 25 11:16:23  [<ffffffff8023f207>] update_process_times+0x57/0x90
> Jan 25 11:16:23  [<ffffffff8021a423>] smp_local_timer_interrupt+0x23/0x50
> Jan 25 11:16:23  [<ffffffff8021ad31>] smp_apic_timer_interrupt+0x41/0x50
> Jan 25 11:16:23  [<ffffffff8020a936>] apic_timer_interrupt+0x66/0x6c
> Jan 25 11:16:23  <EOI> [<ffffffff804187e3>] __lock_text_start+0x3/0x10
> Jan 25 11:16:23  [<ffffffff8851d97c>] :ptlrpc:ptlrpc_check_set+0x6bc/0xb70
> Jan 25 11:16:23  [<ffffffff88518f0a>] :ptlrpc:__ptlrpc_free_req+0x67a/0x6e0
> Jan 25 11:16:23  [<ffffffff8854804c>] :ptlrpc:ptlrpcd_check+0x17c/0x2a0
> Jan 25 11:16:23  [<ffffffff8023ef50>] process_timeout+0x0/0x10
> Jan 25 11:16:23  [<ffffffff8024b91c>] add_wait_queue+0x1c/0x60
> Jan 25 11:16:23  [<ffffffff885487ad>] :ptlrpc:ptlrpcd+0xed/0x272
> Jan 25 11:16:23  [<ffffffff8028da91>] filp_close+0x71/0x90
> Jan 25 11:16:23  [<ffffffff8022f490>] default_wake_function+0x0/0x10
> Jan 25 11:16:23  [<ffffffff8022f490>] default_wake_function+0x0/0x10
> Jan 25 11:16:23  [<ffffffff8020ac4c>] child_rip+0xa/0x12
> Jan 25 11:16:23  [<ffffffff885486c0>] :ptlrpc:ptlrpcd+0x0/0x272
> Jan 25 11:16:23  [<ffffffff8020ac42>] child_rip+0x0/0x12
> ========
>
> mkdir striped-4ways
> lfs setstripe striped-4ways 0 -1 4
> repeat above test
> After 10-15 minutes it locks up, this time with a bunch of
> LustreErrors before the stack trace:
> ========
> Jan 25 13:30:40 LustreError:
> 5748:0:(client.c:975:ptlrpc_expire_one_request()) @@@ timeout (sent at
> 1201264136, 103s ago)  req at ffff8100e3317e00 x1219785/t0
> o6->hpfs-OST0004_UUID at 130.239.78.239@tcp:28 lens 336/336 ref 1 fl Rpc:/0/0
> rc 0/-22 Jan 25 13:30:40 Lustre: hpfs-OST0004-osc-ffff8100ecad4000:
> Connection to service hpfs-OST0004 via nid 130.239.78.239 at tcp was lost; in
> progress operations using this service will wait for recovery to complete.
> Jan 25 13:30:54 BUG: soft lockup detected on CPU#1!
> Jan 25 13:30:54
> Jan 25 13:30:54 Call Trace:
> Jan 25 13:30:54  <IRQ> [<ffffffff80263eec>] softlockup_tick+0xfc/0x120
> Jan 25 13:30:54  [<ffffffff8023f207>] update_process_times+0x57/0x90
> Jan 25 13:30:54  [<ffffffff8021a423>] smp_local_timer_interrupt+0x23/0x50
> Jan 25 13:30:54  [<ffffffff8021ad31>] smp_apic_timer_interrupt+0x41/0x50
> Jan 25 13:30:54  [<ffffffff8020a936>] apic_timer_interrupt+0x66/0x6c
> Jan 25 13:30:54  <EOI> [<ffffffff8852dc70>]
> :ptlrpc:reply_in_callback+0x0/0x2b0 Jan 25 13:30:54  [<ffffffff80418a69>]
> .text.lock.spinlock+0x0/0x97 Jan 25 13:30:54  [<ffffffff884385be>]
> :lnet:LNetMEAttach+0x24e/0x330 Jan 25 13:30:54  [<ffffffff88524771>]
> :ptlrpc:ptl_send_rpc+0x711/0xf20 Jan 25 13:30:54  [<ffffffff8851c727>]
> :ptlrpc:ptlrpc_unregister_reply+0x107/0x2f0 Jan 25 13:30:54 
> [<ffffffff8852dc70>] :ptlrpc:reply_in_callback+0x0/0x2b0 Jan 25 13:30:54 
> [<ffffffff88529ae7>] :ptlrpc:lustre_msg_add_flags+0x47/0x120 Jan 25
> 13:30:54  [<ffffffff8851d923>] :ptlrpc:ptlrpc_check_set+0x663/0xb70 Jan 25
> 13:30:54  [<ffffffff885447ea>] :ptlrpc:ptlrpc_fail_import+0x9a/0x220 Jan 25
> 13:30:54  [<ffffffff8852869f>] :ptlrpc:lustre_msg_get_conn_cnt+0x4f/0x100
> Jan 25 13:30:54  [<ffffffff8852dc70>] :ptlrpc:reply_in_callback+0x0/0x2b0
> Jan 25 13:30:54  [<ffffffff8854804c>] :ptlrpc:ptlrpcd_check+0x17c/0x2a0 Jan
> 25 13:30:54  [<ffffffff8023ef50>] process_timeout+0x0/0x10
> Jan 25 13:30:54  [<ffffffff8024b91c>] add_wait_queue+0x1c/0x60
> Jan 25 13:30:54  [<ffffffff885487ad>] :ptlrpc:ptlrpcd+0xed/0x272
> Jan 25 13:30:54  [<ffffffff8022f490>] default_wake_function+0x0/0x10
> Jan 25 13:30:54  [<ffffffff8022f490>] default_wake_function+0x0/0x10
> Jan 25 13:30:54  [<ffffffff8020ac4c>] child_rip+0xa/0x12
> Jan 25 13:30:54  [<ffffffff885486c0>] :ptlrpc:ptlrpcd+0x0/0x272
> Jan 25 13:30:54  [<ffffffff8020ac42>] child_rip+0x0/0x12
> ========
>
>
> Note that the 2 stacktraces are somewhat different.
>
>
> If run in non-striped directory it doesn't lockup.
>
>
>
> /Nikke

-- 
Harald van Pee

Helmholtz-Institut fuer Strahlen- und Kernphysik der Universitaet Bonn



More information about the lustre-discuss mailing list