[Lustre-discuss] LBUG mds_reint.c, questions about recovery time

Mon Oct 13 11:21:01 PDT 2008

Hi all,

I just ran into a LBUG on an MDS still running Lustre Version 1.6.3 with 
kernel 2.6.18, Debian Etch.
kern.log c.f. below. You will probably tell me that is a known BUG 
already fixed/ to be fixed (I'm unsure how to search for such a thing in 
bugzilla).
But my main question concerns the subsequent recovery. It seems to have 
worked fine, however it took 2 hours. I would like to know what 
influences the recovery time?
During this period, I was watching 
/proc/fs/lustre/mds/lustre-MDT0000/recovery_status. It kind of 
continually showed a remainder time of 2100 sec, fluctuating between 
2400 and 1900, until the last 10 min or so, when the time really went 
down. So this is just a rough guess of Lustre as to what the remaining 
recovery time might be?
recovery_status also showed 346 connected clients, of which 146 had 
finished for a long time, the others obviously not. I wanted to be very 
clever and manually umounted Lustre on a number of our batch nodes which 
were not using Lustre at that time. This did neither influence the given 
number of connected clients nor did it have any perceptible effect on 
recovery.

-----------------------------------------------------------------------------------------------------------------------------------
Oct 13 17:10:58  kernel: LustreError: 
9132:0:(mds_reint.c:1512:mds_orphan_add_link()) ASSERTION(inode->i_nlink 
== 1) f
ailed:dir nlink == 0
Oct 13 17:10:58  kernel: LustreError: 
9132:0:(mds_reint.c:1512:mds_orphan_add_link()) LBUG
Oct 13 17:10:58  kernel: Lustre: 
9132:0:(linux-debug.c:168:libcfs_debug_dumpstack()) showing stack for 
process 9132
Oct 13 17:10:58  kernel: ll_mdt_77     R running     0  9132      1 
      9133  9131 (L-TLB)
Oct 13 17:10:58  kernel: e14eb98c 00000046 55c3b8b1 000016ab 0000006e 
0000000a c084b550 e1abeaa0
Oct 13 17:10:58  kernel: 8d1b6f09 001aedee 0000c81c 00000001 c0116bb3 
dffcc000 ea78f060 c02cbab0
Oct 13 17:10:58  kernel: dffcc000 00000082 c0117c15 0013fa7b 00000000 
00000001 3638acd3 00003931
Oct 13 17:10:58  kernel: Call Trace:
Oct 13 17:10:58  kernel: [<c0116bb3>] task_rq_lock+0x31/0x58
Oct 13 17:10:58  kernel: [<c0116bb3>] task_rq_lock+0x31/0x58
Oct 13 17:10:58  kernel: [<c0116bb3>] task_rq_lock+0x31/0x58
Oct 13 17:10:58  kernel: [<c011de22>] printk+0x14/0x18
Oct 13 17:10:58  kernel: [<c0136851>] __print_symbol+0x9f/0xa8
Oct 13 17:10:58  kernel: [<c0116bb3>] task_rq_lock+0x31/0x58
Oct 13 17:10:58  kernel: [<c0117c15>] try_to_wake_up+0x355/0x35f
Oct 13 17:10:58  kernel: [<c01166f5>] __wake_up_common+0x2f/0x53
Oct 13 17:10:58  kernel: [<c0116b46>] __wake_up+0x2a/0x3d
Oct 13 17:10:58  kernel: [<c011d854>] release_console_sem+0x1b4/0x1bc
Oct 13 17:10:58  kernel: [<c011d854>] release_console_sem+0x1b4/0x1bc
Oct 13 17:10:58  kernel: [<c011d854>] release_console_sem+0x1b4/0x1bc
Oct 13 17:10:58  kernel: [<c012c6d8>] __kernel_text_address+0x18/0x23
Oct 13 17:10:58  kernel: [<c0103b62>] show_trace_log_lvl+0x47/0x6a
Oct 13 17:10:58  kernel: [<c0103c13>] show_stack_log_lvl+0x8e/0x96
Oct 13 17:10:58  kernel: [<c0104107>] show_stack+0x20/0x25
Oct 13 17:10:58  kernel: [<fa1bef79>] lbug_with_loc+0x69/0xc0 [libcfs]
Oct 13 17:10:58  kernel: [<fa689448>] mds_orphan_add_link+0xcb8/0xd20 [mds]
Oct 13 17:10:58  kernel: [<fa69c87a>] mds_reint_unlink+0x292a/0x3fd0 [mds]
Oct 13 17:10:58  kernel: [<fa3ac990>] lustre_swab_ldlm_request+0x0/0x20 
[ptlrpc]
Oct 13 17:10:58  kernel: [<fa688495>] mds_reint_rec+0xf5/0x3f0 [mds]
Oct 13 17:10:58  kernel: [<fa39f788>] ptl_send_buf+0x1b8/0xb00 [ptlrpc]
Oct 13 17:10:58  kernel: [<fa66bfeb>] mds_reint+0xcb/0x8a0 [mds]
Oct 13 17:10:58  kernel: [<fa67f998>] mds_handle+0x3048/0xb9df [mds]
Oct 13 17:10:58  kernel: [<fa4ac402>] LNetMEAttach+0x142/0x4a0 [lnet]
Oct 13 17:10:58  kernel: [<fa2dcd91>] class_handle_free_cb+0x21/0x190 
[obdclass]
Oct 13 17:10:58  kernel: [<c0124d83>] do_gettimeofday+0x31/0xce
Oct 13 17:10:58  kernel: [<fa2dc06b>] class_handle2object+0xbb/0x2a0 
[obdclass]
Oct 13 17:10:58  kernel: [<fa3aca00>] lustre_swab_ptlrpc_body+0x0/0xc0 
[ptlrpc]
Oct 13 17:10:58  kernel: [<fa3a9b5a>] lustre_swab_buf+0xfa/0x180 [ptlrpc]
Oct 13 17:10:58  kernel: [<c0125aac>] lock_timer_base+0x15/0x2f
Oct 13 17:10:59  kernel: [<c0125bbd>] __mod_timer+0x99/0xa3
Oct 13 17:10:59  kernel: [<fa3a6efe>] lustre_msg_get_conn_cnt+0xce/0x220 
[ptlrpc]
Oct 13 17:10:59  kernel: [<fa3b8e56>] ptlrpc_main+0x2016/0x2f40 [ptlrpc]
Oct 13 17:10:59  kernel: [<c01b6dc0>] __next_cpu+0x12/0x21
Oct 13 17:10:59  kernel: [<c012053d>] do_exit+0x711/0x71b
Oct 13 17:10:59  kernel: [<c0117c1f>] default_wake_function+0x0/0xc
Oct 13 17:10:59  kernel: [<fa3b6e40>] ptlrpc_main+0x0/0x2f40 [ptlrpc]
Oct 13 17:10:59  kernel: [<c0101005>] kernel_thread_helper+0x5/0xb

Oct 13 17:12:38  kernel: Lustre: 0:0:(watchdog.c:130:lcw_cb()) Watchdog 
triggered for pid 9132: it was inactive for 10
0s

Cheers,
Thomas