[Lustre-discuss] Lustre 1.6.4.1 - client lockup

Sun Jan 27 22:34:30 PST 2008

On Fri, 25 Jan 2008, Harald van Pee wrote:

> Hi,
>
> thats interessting for me,
> can you just try what happens if you delete a large directory
> (lots of files with couple of GB total space) from this client?

Works, as long as we only have one client doing rm on the directory. 
If we do rm concurrently from multiple clients the MDS bugs out.

> I have a test cluster with 1.6.4.1 kernel 2.6.18.8 vanilla
> running. The clients are patchless,
> server and clients are rock stable since
> weaks, but I have only one dual opteron machine (others are mostly athlon and
> couple of pentium)
> connected with GigE
> which ist a rock solid machine if I don't mount lustre.
> If I mount lustre on this machine it crashs all the the time.
> The last crash happens directly after  I tried to delete a large directory
> from this client.
> Up to now I thougt I must have done something wrong with the installation of
> this client, because it behaves completly different than the others, but
> maybe I am wrong?

It might be the same bug, or not... IMHO it's an indication of a 
buffer overrun that happens more often on a 64bit box due to increased 
storage needed for pointers and so on... But with only one machine 
crashing it's hard to rule out other issues.

>
> Harald
>
>
> On Friday 25 January 2008 04:10 pm, Niklas Edmundsson wrote:
>> Hi again!
>>
>> We're able to consistently kill the lustre client with bonnie in
>> combination with striping. This is Lustre 1.6.4.1, Debian 2.6.18 amd64
>> kernel with lustre patches on both server and clients (ie. not
>> patchless client, even though we're pretty sure that it's the same bug
>> that bites us using ubuntu 2.6.15 kernel and patchless client).
>>
>> All machines are dual opterons connected with GigE.
>>
>> We have 5 servers, 1 MDS with 1 MGS and 1 MDT target and 4 OSS's with
>> 2 OST targets (~1.2TB) each.
>>
>> We're able to consistently cause a lustre client lock-up doing the
>> following:
>>
>> cd /into-lustre-filsystem
>> mkdir striped
>> lfs setstripe striped 0 -1 -1
>> cd striped
>> mkdir host1 host2 host3 host4 host5
>> for i in host1 host2 host3 host4 host5; do
>>    rsh $i "cd $PWD; bonnie++ -d $i -n 60:0:0:30 > res.$i 2>&1" &
>> done
>> After 10-15 minutes it locks up with the following stacktrace:
>> ========
>> Jan 25 11:16:23 BUG: soft lockup detected on CPU#1!
>> Jan 25 11:16:23
>> Jan 25 11:16:23 Call Trace:
>> Jan 25 11:16:23  <IRQ> [<ffffffff80263eec>] softlockup_tick+0xfc/0x120
>> Jan 25 11:16:23  [<ffffffff8023f207>] update_process_times+0x57/0x90
>> Jan 25 11:16:23  [<ffffffff8021a423>] smp_local_timer_interrupt+0x23/0x50
>> Jan 25 11:16:23  [<ffffffff8021ad31>] smp_apic_timer_interrupt+0x41/0x50
>> Jan 25 11:16:23  [<ffffffff8020a936>] apic_timer_interrupt+0x66/0x6c
>> Jan 25 11:16:23  <EOI> [<ffffffff804187e3>] __lock_text_start+0x3/0x10
>> Jan 25 11:16:23  [<ffffffff8851d97c>] :ptlrpc:ptlrpc_check_set+0x6bc/0xb70
>> Jan 25 11:16:23  [<ffffffff88518f0a>] :ptlrpc:__ptlrpc_free_req+0x67a/0x6e0
>> Jan 25 11:16:23  [<ffffffff8854804c>] :ptlrpc:ptlrpcd_check+0x17c/0x2a0
>> Jan 25 11:16:23  [<ffffffff8023ef50>] process_timeout+0x0/0x10
>> Jan 25 11:16:23  [<ffffffff8024b91c>] add_wait_queue+0x1c/0x60
>> Jan 25 11:16:23  [<ffffffff885487ad>] :ptlrpc:ptlrpcd+0xed/0x272
>> Jan 25 11:16:23  [<ffffffff8028da91>] filp_close+0x71/0x90
>> Jan 25 11:16:23  [<ffffffff8022f490>] default_wake_function+0x0/0x10
>> Jan 25 11:16:23  [<ffffffff8022f490>] default_wake_function+0x0/0x10
>> Jan 25 11:16:23  [<ffffffff8020ac4c>] child_rip+0xa/0x12
>> Jan 25 11:16:23  [<ffffffff885486c0>] :ptlrpc:ptlrpcd+0x0/0x272
>> Jan 25 11:16:23  [<ffffffff8020ac42>] child_rip+0x0/0x12
>> ========
>>
>> mkdir striped-4ways
>> lfs setstripe striped-4ways 0 -1 4
>> repeat above test
>> After 10-15 minutes it locks up, this time with a bunch of
>> LustreErrors before the stack trace:
>> ========
>> Jan 25 13:30:40 LustreError:
>> 5748:0:(client.c:975:ptlrpc_expire_one_request()) @@@ timeout (sent at
>> 1201264136, 103s ago)  req at ffff8100e3317e00 x1219785/t0
>> o6->hpfs-OST0004_UUID at 130.239.78.239@tcp:28 lens 336/336 ref 1 fl Rpc:/0/0
>> rc 0/-22 Jan 25 13:30:40 Lustre: hpfs-OST0004-osc-ffff8100ecad4000:
>> Connection to service hpfs-OST0004 via nid 130.239.78.239 at tcp was lost; in
>> progress operations using this service will wait for recovery to complete.
>> Jan 25 13:30:54 BUG: soft lockup detected on CPU#1!
>> Jan 25 13:30:54
>> Jan 25 13:30:54 Call Trace:
>> Jan 25 13:30:54  <IRQ> [<ffffffff80263eec>] softlockup_tick+0xfc/0x120
>> Jan 25 13:30:54  [<ffffffff8023f207>] update_process_times+0x57/0x90
>> Jan 25 13:30:54  [<ffffffff8021a423>] smp_local_timer_interrupt+0x23/0x50
>> Jan 25 13:30:54  [<ffffffff8021ad31>] smp_apic_timer_interrupt+0x41/0x50
>> Jan 25 13:30:54  [<ffffffff8020a936>] apic_timer_interrupt+0x66/0x6c
>> Jan 25 13:30:54  <EOI> [<ffffffff8852dc70>]
>> :ptlrpc:reply_in_callback+0x0/0x2b0 Jan 25 13:30:54  [<ffffffff80418a69>]
>> .text.lock.spinlock+0x0/0x97 Jan 25 13:30:54  [<ffffffff884385be>]
>> :lnet:LNetMEAttach+0x24e/0x330 Jan 25 13:30:54  [<ffffffff88524771>]
>> :ptlrpc:ptl_send_rpc+0x711/0xf20 Jan 25 13:30:54  [<ffffffff8851c727>]
>> :ptlrpc:ptlrpc_unregister_reply+0x107/0x2f0 Jan 25 13:30:54
>> [<ffffffff8852dc70>] :ptlrpc:reply_in_callback+0x0/0x2b0 Jan 25 13:30:54
>> [<ffffffff88529ae7>] :ptlrpc:lustre_msg_add_flags+0x47/0x120 Jan 25
>> 13:30:54  [<ffffffff8851d923>] :ptlrpc:ptlrpc_check_set+0x663/0xb70 Jan 25
>> 13:30:54  [<ffffffff885447ea>] :ptlrpc:ptlrpc_fail_import+0x9a/0x220 Jan 25
>> 13:30:54  [<ffffffff8852869f>] :ptlrpc:lustre_msg_get_conn_cnt+0x4f/0x100
>> Jan 25 13:30:54  [<ffffffff8852dc70>] :ptlrpc:reply_in_callback+0x0/0x2b0
>> Jan 25 13:30:54  [<ffffffff8854804c>] :ptlrpc:ptlrpcd_check+0x17c/0x2a0 Jan
>> 25 13:30:54  [<ffffffff8023ef50>] process_timeout+0x0/0x10
>> Jan 25 13:30:54  [<ffffffff8024b91c>] add_wait_queue+0x1c/0x60
>> Jan 25 13:30:54  [<ffffffff885487ad>] :ptlrpc:ptlrpcd+0xed/0x272
>> Jan 25 13:30:54  [<ffffffff8022f490>] default_wake_function+0x0/0x10
>> Jan 25 13:30:54  [<ffffffff8022f490>] default_wake_function+0x0/0x10
>> Jan 25 13:30:54  [<ffffffff8020ac4c>] child_rip+0xa/0x12
>> Jan 25 13:30:54  [<ffffffff885486c0>] :ptlrpc:ptlrpcd+0x0/0x272
>> Jan 25 13:30:54  [<ffffffff8020ac42>] child_rip+0x0/0x12
>> ========
>>
>>
>> Note that the 2 stacktraces are somewhat different.
>>
>>
>> If run in non-striped directory it doesn't lockup.
>>
>>
>>
>> /Nikke
>
>

/Nikke
-- 
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
  Niklas Edmundsson, Admin @ {acc,hpc2n}.umu.se     |    nikke at hpc2n.umu.se
---------------------------------------------------------------------------
  Riker: If it becomes necessary to fight, can someone find @N@
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=