[Lustre-discuss] Lustre 1.6.4.1 - client lockup

Niklas Edmundsson Niklas.Edmundsson at hpc2n.umu.se
Fri Jan 25 07:10:47 PST 2008


Hi again!

We're able to consistently kill the lustre client with bonnie in 
combination with striping. This is Lustre 1.6.4.1, Debian 2.6.18 amd64
kernel with lustre patches on both server and clients (ie. not 
patchless client, even though we're pretty sure that it's the same bug 
that bites us using ubuntu 2.6.15 kernel and patchless client).

All machines are dual opterons connected with GigE.

We have 5 servers, 1 MDS with 1 MGS and 1 MDT target and 4 OSS's with
2 OST targets (~1.2TB) each.

We're able to consistently cause a lustre client lock-up doing the
following:

cd /into-lustre-filsystem
mkdir striped
lfs setstripe striped 0 -1 -1
cd striped
mkdir host1 host2 host3 host4 host5
for i in host1 host2 host3 host4 host5; do
   rsh $i "cd $PWD; bonnie++ -d $i -n 60:0:0:30 > res.$i 2>&1" &
done
After 10-15 minutes it locks up with the following stacktrace:
========
Jan 25 11:16:23 BUG: soft lockup detected on CPU#1!
Jan 25 11:16:23 
Jan 25 11:16:23 Call Trace:
Jan 25 11:16:23  <IRQ> [<ffffffff80263eec>] softlockup_tick+0xfc/0x120
Jan 25 11:16:23  [<ffffffff8023f207>] update_process_times+0x57/0x90
Jan 25 11:16:23  [<ffffffff8021a423>] smp_local_timer_interrupt+0x23/0x50
Jan 25 11:16:23  [<ffffffff8021ad31>] smp_apic_timer_interrupt+0x41/0x50
Jan 25 11:16:23  [<ffffffff8020a936>] apic_timer_interrupt+0x66/0x6c
Jan 25 11:16:23  <EOI> [<ffffffff804187e3>] __lock_text_start+0x3/0x10
Jan 25 11:16:23  [<ffffffff8851d97c>] :ptlrpc:ptlrpc_check_set+0x6bc/0xb70
Jan 25 11:16:23  [<ffffffff88518f0a>] :ptlrpc:__ptlrpc_free_req+0x67a/0x6e0
Jan 25 11:16:23  [<ffffffff8854804c>] :ptlrpc:ptlrpcd_check+0x17c/0x2a0
Jan 25 11:16:23  [<ffffffff8023ef50>] process_timeout+0x0/0x10
Jan 25 11:16:23  [<ffffffff8024b91c>] add_wait_queue+0x1c/0x60
Jan 25 11:16:23  [<ffffffff885487ad>] :ptlrpc:ptlrpcd+0xed/0x272
Jan 25 11:16:23  [<ffffffff8028da91>] filp_close+0x71/0x90
Jan 25 11:16:23  [<ffffffff8022f490>] default_wake_function+0x0/0x10
Jan 25 11:16:23  [<ffffffff8022f490>] default_wake_function+0x0/0x10
Jan 25 11:16:23  [<ffffffff8020ac4c>] child_rip+0xa/0x12
Jan 25 11:16:23  [<ffffffff885486c0>] :ptlrpc:ptlrpcd+0x0/0x272
Jan 25 11:16:23  [<ffffffff8020ac42>] child_rip+0x0/0x12
========

mkdir striped-4ways
lfs setstripe striped-4ways 0 -1 4
repeat above test
After 10-15 minutes it locks up, this time with a bunch of
LustreErrors before the stack trace:
========
Jan 25 13:30:40 LustreError: 5748:0:(client.c:975:ptlrpc_expire_one_request()) @@@ timeout (sent at 1201264136, 103s ago)  req at ffff8100e3317e00 x1219785/t0 o6->hpfs-OST0004_UUID at 130.239.78.239@tcp:28 lens 336/336 ref 1 fl Rpc:/0/0 rc 0/-22
Jan 25 13:30:40 Lustre: hpfs-OST0004-osc-ffff8100ecad4000: Connection to service hpfs-OST0004 via nid 130.239.78.239 at tcp was lost; in progress operations using this service will wait for recovery to complete.
Jan 25 13:30:54 BUG: soft lockup detected on CPU#1!
Jan 25 13:30:54 
Jan 25 13:30:54 Call Trace:
Jan 25 13:30:54  <IRQ> [<ffffffff80263eec>] softlockup_tick+0xfc/0x120
Jan 25 13:30:54  [<ffffffff8023f207>] update_process_times+0x57/0x90
Jan 25 13:30:54  [<ffffffff8021a423>] smp_local_timer_interrupt+0x23/0x50
Jan 25 13:30:54  [<ffffffff8021ad31>] smp_apic_timer_interrupt+0x41/0x50
Jan 25 13:30:54  [<ffffffff8020a936>] apic_timer_interrupt+0x66/0x6c
Jan 25 13:30:54  <EOI> [<ffffffff8852dc70>] :ptlrpc:reply_in_callback+0x0/0x2b0
Jan 25 13:30:54  [<ffffffff80418a69>] .text.lock.spinlock+0x0/0x97
Jan 25 13:30:54  [<ffffffff884385be>] :lnet:LNetMEAttach+0x24e/0x330
Jan 25 13:30:54  [<ffffffff88524771>] :ptlrpc:ptl_send_rpc+0x711/0xf20
Jan 25 13:30:54  [<ffffffff8851c727>] :ptlrpc:ptlrpc_unregister_reply+0x107/0x2f0
Jan 25 13:30:54  [<ffffffff8852dc70>] :ptlrpc:reply_in_callback+0x0/0x2b0
Jan 25 13:30:54  [<ffffffff88529ae7>] :ptlrpc:lustre_msg_add_flags+0x47/0x120
Jan 25 13:30:54  [<ffffffff8851d923>] :ptlrpc:ptlrpc_check_set+0x663/0xb70
Jan 25 13:30:54  [<ffffffff885447ea>] :ptlrpc:ptlrpc_fail_import+0x9a/0x220
Jan 25 13:30:54  [<ffffffff8852869f>] :ptlrpc:lustre_msg_get_conn_cnt+0x4f/0x100
Jan 25 13:30:54  [<ffffffff8852dc70>] :ptlrpc:reply_in_callback+0x0/0x2b0
Jan 25 13:30:54  [<ffffffff8854804c>] :ptlrpc:ptlrpcd_check+0x17c/0x2a0
Jan 25 13:30:54  [<ffffffff8023ef50>] process_timeout+0x0/0x10
Jan 25 13:30:54  [<ffffffff8024b91c>] add_wait_queue+0x1c/0x60
Jan 25 13:30:54  [<ffffffff885487ad>] :ptlrpc:ptlrpcd+0xed/0x272
Jan 25 13:30:54  [<ffffffff8022f490>] default_wake_function+0x0/0x10
Jan 25 13:30:54  [<ffffffff8022f490>] default_wake_function+0x0/0x10
Jan 25 13:30:54  [<ffffffff8020ac4c>] child_rip+0xa/0x12
Jan 25 13:30:54  [<ffffffff885486c0>] :ptlrpc:ptlrpcd+0x0/0x272
Jan 25 13:30:54  [<ffffffff8020ac42>] child_rip+0x0/0x12
========


Note that the 2 stacktraces are somewhat different.


If run in non-striped directory it doesn't lockup.



/Nikke
-- 
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
  Niklas Edmundsson, Admin @ {acc,hpc2n}.umu.se     |    nikke at hpc2n.umu.se
---------------------------------------------------------------------------
  "Jake, honey, when did we become Republicans?" - Celeste Kane
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=



More information about the lustre-discuss mailing list