[Lustre-discuss] LBUG mds_reint.c, questions about recovery time

Wojciech Turek wjt27 at cam.ac.uk
Mon Oct 13 16:00:20 PDT 2008


Lustre recovery time is 2.5 x timeout
You can find timeout by running this command on the MDS
cat /proc/sys/lustre/timeout

Thomas Roth wrote:
> Hi all,
>
> I just ran into a LBUG on an MDS still running Lustre Version 1.6.3 with 
> kernel 2.6.18, Debian Etch.
> kern.log c.f. below. You will probably tell me that is a known BUG 
> already fixed/ to be fixed (I'm unsure how to search for such a thing in 
> bugzilla).
> But my main question concerns the subsequent recovery. It seems to have 
> worked fine, however it took 2 hours. I would like to know what 
> influences the recovery time?
> During this period, I was watching 
> /proc/fs/lustre/mds/lustre-MDT0000/recovery_status. It kind of 
> continually showed a remainder time of 2100 sec, fluctuating between 
> 2400 and 1900, until the last 10 min or so, when the time really went 
> down. So this is just a rough guess of Lustre as to what the remaining 
> recovery time might be?
> recovery_status also showed 346 connected clients, of which 146 had 
> finished for a long time, the others obviously not. I wanted to be very 
> clever and manually umounted Lustre on a number of our batch nodes which 
> were not using Lustre at that time. This did neither influence the given 
> number of connected clients nor did it have any perceptible effect on 
> recovery.
>
> -----------------------------------------------------------------------------------------------------------------------------------
> Oct 13 17:10:58  kernel: LustreError: 
> 9132:0:(mds_reint.c:1512:mds_orphan_add_link()) ASSERTION(inode->i_nlink 
> == 1) f
> ailed:dir nlink == 0
> Oct 13 17:10:58  kernel: LustreError: 
> 9132:0:(mds_reint.c:1512:mds_orphan_add_link()) LBUG
> Oct 13 17:10:58  kernel: Lustre: 
> 9132:0:(linux-debug.c:168:libcfs_debug_dumpstack()) showing stack for 
> process 9132
> Oct 13 17:10:58  kernel: ll_mdt_77     R running     0  9132      1 
>       9133  9131 (L-TLB)
> Oct 13 17:10:58  kernel: e14eb98c 00000046 55c3b8b1 000016ab 0000006e 
> 0000000a c084b550 e1abeaa0
> Oct 13 17:10:58  kernel: 8d1b6f09 001aedee 0000c81c 00000001 c0116bb3 
> dffcc000 ea78f060 c02cbab0
> Oct 13 17:10:58  kernel: dffcc000 00000082 c0117c15 0013fa7b 00000000 
> 00000001 3638acd3 00003931
> Oct 13 17:10:58  kernel: Call Trace:
> Oct 13 17:10:58  kernel: [<c0116bb3>] task_rq_lock+0x31/0x58
> Oct 13 17:10:58  kernel: [<c0116bb3>] task_rq_lock+0x31/0x58
> Oct 13 17:10:58  kernel: [<c0116bb3>] task_rq_lock+0x31/0x58
> Oct 13 17:10:58  kernel: [<c011de22>] printk+0x14/0x18
> Oct 13 17:10:58  kernel: [<c0136851>] __print_symbol+0x9f/0xa8
> Oct 13 17:10:58  kernel: [<c0116bb3>] task_rq_lock+0x31/0x58
> Oct 13 17:10:58  kernel: [<c0117c15>] try_to_wake_up+0x355/0x35f
> Oct 13 17:10:58  kernel: [<c01166f5>] __wake_up_common+0x2f/0x53
> Oct 13 17:10:58  kernel: [<c0116b46>] __wake_up+0x2a/0x3d
> Oct 13 17:10:58  kernel: [<c011d854>] release_console_sem+0x1b4/0x1bc
> Oct 13 17:10:58  kernel: [<c011d854>] release_console_sem+0x1b4/0x1bc
> Oct 13 17:10:58  kernel: [<c011d854>] release_console_sem+0x1b4/0x1bc
> Oct 13 17:10:58  kernel: [<c012c6d8>] __kernel_text_address+0x18/0x23
> Oct 13 17:10:58  kernel: [<c0103b62>] show_trace_log_lvl+0x47/0x6a
> Oct 13 17:10:58  kernel: [<c0103c13>] show_stack_log_lvl+0x8e/0x96
> Oct 13 17:10:58  kernel: [<c0104107>] show_stack+0x20/0x25
> Oct 13 17:10:58  kernel: [<fa1bef79>] lbug_with_loc+0x69/0xc0 [libcfs]
> Oct 13 17:10:58  kernel: [<fa689448>] mds_orphan_add_link+0xcb8/0xd20 [mds]
> Oct 13 17:10:58  kernel: [<fa69c87a>] mds_reint_unlink+0x292a/0x3fd0 [mds]
> Oct 13 17:10:58  kernel: [<fa3ac990>] lustre_swab_ldlm_request+0x0/0x20 
> [ptlrpc]
> Oct 13 17:10:58  kernel: [<fa688495>] mds_reint_rec+0xf5/0x3f0 [mds]
> Oct 13 17:10:58  kernel: [<fa39f788>] ptl_send_buf+0x1b8/0xb00 [ptlrpc]
> Oct 13 17:10:58  kernel: [<fa66bfeb>] mds_reint+0xcb/0x8a0 [mds]
> Oct 13 17:10:58  kernel: [<fa67f998>] mds_handle+0x3048/0xb9df [mds]
> Oct 13 17:10:58  kernel: [<fa4ac402>] LNetMEAttach+0x142/0x4a0 [lnet]
> Oct 13 17:10:58  kernel: [<fa2dcd91>] class_handle_free_cb+0x21/0x190 
> [obdclass]
> Oct 13 17:10:58  kernel: [<c0124d83>] do_gettimeofday+0x31/0xce
> Oct 13 17:10:58  kernel: [<fa2dc06b>] class_handle2object+0xbb/0x2a0 
> [obdclass]
> Oct 13 17:10:58  kernel: [<fa3aca00>] lustre_swab_ptlrpc_body+0x0/0xc0 
> [ptlrpc]
> Oct 13 17:10:58  kernel: [<fa3a9b5a>] lustre_swab_buf+0xfa/0x180 [ptlrpc]
> Oct 13 17:10:58  kernel: [<c0125aac>] lock_timer_base+0x15/0x2f
> Oct 13 17:10:59  kernel: [<c0125bbd>] __mod_timer+0x99/0xa3
> Oct 13 17:10:59  kernel: [<fa3a6efe>] lustre_msg_get_conn_cnt+0xce/0x220 
> [ptlrpc]
> Oct 13 17:10:59  kernel: [<fa3b8e56>] ptlrpc_main+0x2016/0x2f40 [ptlrpc]
> Oct 13 17:10:59  kernel: [<c01b6dc0>] __next_cpu+0x12/0x21
> Oct 13 17:10:59  kernel: [<c012053d>] do_exit+0x711/0x71b
> Oct 13 17:10:59  kernel: [<c0117c1f>] default_wake_function+0x0/0xc
> Oct 13 17:10:59  kernel: [<fa3b6e40>] ptlrpc_main+0x0/0x2f40 [ptlrpc]
> Oct 13 17:10:59  kernel: [<c0101005>] kernel_thread_helper+0x5/0xb
>
> Oct 13 17:12:38  kernel: Lustre: 0:0:(watchdog.c:130:lcw_cb()) Watchdog 
> triggered for pid 9132: it was inactive for 10
> 0s
>
>
> Cheers,
> Thomas
>
>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>   




More information about the lustre-discuss mailing list