[Lustre-discuss] LBUG mds_reint.c, questions about recovery time
Thomas Roth
t.roth at gsi.de
Mon Oct 13 11:21:01 PDT 2008
Hi all,
I just ran into a LBUG on an MDS still running Lustre Version 1.6.3 with
kernel 2.6.18, Debian Etch.
kern.log c.f. below. You will probably tell me that is a known BUG
already fixed/ to be fixed (I'm unsure how to search for such a thing in
bugzilla).
But my main question concerns the subsequent recovery. It seems to have
worked fine, however it took 2 hours. I would like to know what
influences the recovery time?
During this period, I was watching
/proc/fs/lustre/mds/lustre-MDT0000/recovery_status. It kind of
continually showed a remainder time of 2100 sec, fluctuating between
2400 and 1900, until the last 10 min or so, when the time really went
down. So this is just a rough guess of Lustre as to what the remaining
recovery time might be?
recovery_status also showed 346 connected clients, of which 146 had
finished for a long time, the others obviously not. I wanted to be very
clever and manually umounted Lustre on a number of our batch nodes which
were not using Lustre at that time. This did neither influence the given
number of connected clients nor did it have any perceptible effect on
recovery.
-----------------------------------------------------------------------------------------------------------------------------------
Oct 13 17:10:58 kernel: LustreError:
9132:0:(mds_reint.c:1512:mds_orphan_add_link()) ASSERTION(inode->i_nlink
== 1) f
ailed:dir nlink == 0
Oct 13 17:10:58 kernel: LustreError:
9132:0:(mds_reint.c:1512:mds_orphan_add_link()) LBUG
Oct 13 17:10:58 kernel: Lustre:
9132:0:(linux-debug.c:168:libcfs_debug_dumpstack()) showing stack for
process 9132
Oct 13 17:10:58 kernel: ll_mdt_77 R running 0 9132 1
9133 9131 (L-TLB)
Oct 13 17:10:58 kernel: e14eb98c 00000046 55c3b8b1 000016ab 0000006e
0000000a c084b550 e1abeaa0
Oct 13 17:10:58 kernel: 8d1b6f09 001aedee 0000c81c 00000001 c0116bb3
dffcc000 ea78f060 c02cbab0
Oct 13 17:10:58 kernel: dffcc000 00000082 c0117c15 0013fa7b 00000000
00000001 3638acd3 00003931
Oct 13 17:10:58 kernel: Call Trace:
Oct 13 17:10:58 kernel: [<c0116bb3>] task_rq_lock+0x31/0x58
Oct 13 17:10:58 kernel: [<c0116bb3>] task_rq_lock+0x31/0x58
Oct 13 17:10:58 kernel: [<c0116bb3>] task_rq_lock+0x31/0x58
Oct 13 17:10:58 kernel: [<c011de22>] printk+0x14/0x18
Oct 13 17:10:58 kernel: [<c0136851>] __print_symbol+0x9f/0xa8
Oct 13 17:10:58 kernel: [<c0116bb3>] task_rq_lock+0x31/0x58
Oct 13 17:10:58 kernel: [<c0117c15>] try_to_wake_up+0x355/0x35f
Oct 13 17:10:58 kernel: [<c01166f5>] __wake_up_common+0x2f/0x53
Oct 13 17:10:58 kernel: [<c0116b46>] __wake_up+0x2a/0x3d
Oct 13 17:10:58 kernel: [<c011d854>] release_console_sem+0x1b4/0x1bc
Oct 13 17:10:58 kernel: [<c011d854>] release_console_sem+0x1b4/0x1bc
Oct 13 17:10:58 kernel: [<c011d854>] release_console_sem+0x1b4/0x1bc
Oct 13 17:10:58 kernel: [<c012c6d8>] __kernel_text_address+0x18/0x23
Oct 13 17:10:58 kernel: [<c0103b62>] show_trace_log_lvl+0x47/0x6a
Oct 13 17:10:58 kernel: [<c0103c13>] show_stack_log_lvl+0x8e/0x96
Oct 13 17:10:58 kernel: [<c0104107>] show_stack+0x20/0x25
Oct 13 17:10:58 kernel: [<fa1bef79>] lbug_with_loc+0x69/0xc0 [libcfs]
Oct 13 17:10:58 kernel: [<fa689448>] mds_orphan_add_link+0xcb8/0xd20 [mds]
Oct 13 17:10:58 kernel: [<fa69c87a>] mds_reint_unlink+0x292a/0x3fd0 [mds]
Oct 13 17:10:58 kernel: [<fa3ac990>] lustre_swab_ldlm_request+0x0/0x20
[ptlrpc]
Oct 13 17:10:58 kernel: [<fa688495>] mds_reint_rec+0xf5/0x3f0 [mds]
Oct 13 17:10:58 kernel: [<fa39f788>] ptl_send_buf+0x1b8/0xb00 [ptlrpc]
Oct 13 17:10:58 kernel: [<fa66bfeb>] mds_reint+0xcb/0x8a0 [mds]
Oct 13 17:10:58 kernel: [<fa67f998>] mds_handle+0x3048/0xb9df [mds]
Oct 13 17:10:58 kernel: [<fa4ac402>] LNetMEAttach+0x142/0x4a0 [lnet]
Oct 13 17:10:58 kernel: [<fa2dcd91>] class_handle_free_cb+0x21/0x190
[obdclass]
Oct 13 17:10:58 kernel: [<c0124d83>] do_gettimeofday+0x31/0xce
Oct 13 17:10:58 kernel: [<fa2dc06b>] class_handle2object+0xbb/0x2a0
[obdclass]
Oct 13 17:10:58 kernel: [<fa3aca00>] lustre_swab_ptlrpc_body+0x0/0xc0
[ptlrpc]
Oct 13 17:10:58 kernel: [<fa3a9b5a>] lustre_swab_buf+0xfa/0x180 [ptlrpc]
Oct 13 17:10:58 kernel: [<c0125aac>] lock_timer_base+0x15/0x2f
Oct 13 17:10:59 kernel: [<c0125bbd>] __mod_timer+0x99/0xa3
Oct 13 17:10:59 kernel: [<fa3a6efe>] lustre_msg_get_conn_cnt+0xce/0x220
[ptlrpc]
Oct 13 17:10:59 kernel: [<fa3b8e56>] ptlrpc_main+0x2016/0x2f40 [ptlrpc]
Oct 13 17:10:59 kernel: [<c01b6dc0>] __next_cpu+0x12/0x21
Oct 13 17:10:59 kernel: [<c012053d>] do_exit+0x711/0x71b
Oct 13 17:10:59 kernel: [<c0117c1f>] default_wake_function+0x0/0xc
Oct 13 17:10:59 kernel: [<fa3b6e40>] ptlrpc_main+0x0/0x2f40 [ptlrpc]
Oct 13 17:10:59 kernel: [<c0101005>] kernel_thread_helper+0x5/0xb
Oct 13 17:12:38 kernel: Lustre: 0:0:(watchdog.c:130:lcw_cb()) Watchdog
triggered for pid 9132: it was inactive for 10
0s
Cheers,
Thomas
More information about the lustre-discuss
mailing list