[Lustre-discuss] Looping in __d_lookup

Robin Humble rjh+lustre at cita.utoronto.ca
Wed May 21 01:18:39 PDT 2008


On Tue, May 20, 2008 at 10:24:25PM +0200, Jakob Goldbach wrote:
>> Hm, so you actually have a circular loop? 
>Yes - I've asked for help on the OpenVZ list as well - Pavel Emelyanov
>provided me with a debug patch. This patch has now confirmed the loop in
>__d_lookup.

we're also seeing __d_lookup soft lockups. 4 are attached.

one was a cached_lookup but the first line of that fn is a __d_lookup
which is where I suspect is where the real soft lockup occurred. after a
while the node is toast and has to be rebooted.

kernel is 2.6.23.17 with patchless lustre 1.6.4.3, modified with
patches from bz 14250 (attachment 14109) and 13378 (attachment 12276).
using o2ib, x86_64, rhel5.1

cheers,
robin

ps. I'm glad to see some Lustre support for modern linux kernels and
would like to see a lot more! the VM's of distro kernel sometimes
behave erratically enough that we can't use them on Lustre client nodes
in production.
-------------- next part --------------
May  6 14:36:53 x6 kernel: BUG: soft lockup - CPU#2 stuck for 11s! [rsync:14200] 
May  6 14:36:53 x6 kernel: CPU 2: 
May  6 14:36:53 x6 kernel: Modules linked in: ext3 jbd loop osc mgc lustre lov lquota mdc ko2iblnd ptlrpc obdclass lnet lvfs libcfs  rdma_ucm ib_ucm rdma_cm iw_cm ib_addr ib_srp ib_ipoib ib_cm ib_sa ib_uverbs ib_umad binfmt_misc dm_mirror dm_multipath dm_mod
battery ac sg sd_mod i2c_i801 i2c_core ahci ata_piix libata ib_mthca scsi_mod i5000_edac ib_mad edac_core ib_core ehci_hcd  shpchp uhci_hcd rng_core button nfs nfs_acl lockd sunrpc e1000
May  6 14:36:53 x6 kernel: Pid: 14200, comm: rsync Not tainted 2.6.23.17 #1 
May  6 14:36:53 x6 kernel: RIP: 0010:[<ffffffff80282cef>]  [<ffffffff80282cef>] __d_lookup+0xed/0x110 
May  6 14:36:53 x6 kernel: RSP: 0018:ffff81011b7ebbe8  EFLAGS: 00000286 
May  6 14:36:53 x6 kernel: RAX: ffff81011d71aea0 RBX: ffff81011d71aea0 RCX: 0000000000000014 
May  6 14:36:53 x6 kernel: RDX: 00000000000cb941 RSI: ffff81011b7ebcb8 RDI: ffff810120522cb8 
May  6 14:36:53 x6 kernel: RBP: ffff8101d2707408 R08: 0000000000000007 R09: 0000000000000007 
May  6 14:36:53 x6 kernel: R10: 00007fffffffb750 R11: ffffffff802c7d3e R12: ffffffff8853dc60 
May  6 14:36:53 x6 kernel: R13: ffff8100992a5470 R14: ffff81024adb3af8 R15: ffff810219defe80 
May  6 14:36:53 x6 kernel: FS:  00002aaaab6296e0(0000) GS:ffff81025fc6d840(0000) knlGS:0000000000000000 
May  6 14:36:53 x6 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b 
May  6 14:36:53 x6 kernel: CR2: 000000000079f018 CR3: 000000018c1d0000 CR4: 00000000000006e0 
May  6 14:36:53 x6 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 
May  6 14:36:54 x6 kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 
May  6 14:36:54 x6 kernel:  
May  6 14:36:54 x6 kernel: Call Trace: 
May  6 14:36:54 x6 kernel:  [<ffffffff80282cd5>] __d_lookup+0xd3/0x110 
May  6 14:36:54 x6 kernel:  [<ffffffff80279b1e>] do_lookup+0x2a/0x1ae 
May  6 14:36:54 x6 kernel:  [<ffffffff8027bc4d>] __link_path_walk+0x924/0xde9 
May  6 14:36:54 x6 kernel:  [<ffffffff8027c16a>] link_path_walk+0x58/0xe0 
May  6 14:36:54 x6 kernel:  [<ffffffff8027c536>] do_path_lookup+0x1ab/0x1cf 
May  6 14:36:54 x6 kernel:  [<ffffffff8027afe4>] getname+0x14c/0x191 
May  6 14:36:54 x6 kernel:  [<ffffffff8027cd67>] __user_walk_fd+0x37/0x53 
May  6 14:36:54 x6 kernel:  [<ffffffff80275a7b>] vfs_lstat_fd+0x18/0x47 
May  6 14:36:54 x6 kernel:  [<ffffffff80275c6d>] sys_newlstat+0x19/0x31 
May  6 14:36:54 x6 kernel:  [<ffffffff8020b3ae>] system_call+0x7e/0x83

May  8 09:32:36 x10 kernel: BUG: soft lockup - CPU#1 stuck for 11s! [bonnie++.mpi:31720] 
May  8 09:32:36 x10 kernel: CPU 1: 
May  8 09:32:36 x10 kernel: Modules linked in: loop osc mgc lustre lov lquota mdc ko2iblnd ptlrpc obdclass lnet lvfs libcfs rdma_ucm ib_ucm rdma_cm iw_cm ib_addr ib_srp ib_ipoib ib_cm ib_sa ib_uverbs ib_umad binfmt_misc dm_mirror dm_multipath dm_mod battery
ac sg sd_mod i5000_edac edac_core ehci_hcd ahci rng_core ata_piix libata i2c_i801 scsi_mod i2c_core ib_mthca ib_mad shpchp uhci_hcd ib_core button nfs nfs_acl lockd sunrpc e1000 
May  8 09:32:36 x10 kernel: Pid: 31720, comm: bonnie++.mpi Not tainted 2.6.23.17 #1 
May  8 09:32:36 x10 kernel: RIP: 0010:[<ffffffff80282cef>]  [<ffffffff80282cef>] __d_lookup+0xed/0x110 
May  8 09:32:36 x10 kernel: RSP: 0018:ffff81014248fd78  EFLAGS: 00000286 
May  8 09:32:36 x10 kernel: RAX: ffff8101a79cc500 RBX: ffff8101a79cc500 RCX: 0000000000000014 
May  8 09:32:36 x10 kernel: RDX: 00000000000ac16f RSI: ffff81014248feb8 RDI: ffff8101ff617898 
May  8 09:32:36 x10 kernel: RBP: ffff8101ff617898 R08: 0000000000000000 R09: ffff81025fd3d5c0 
May  8 09:32:37 x10 kernel: R10: 0000000000000001 R11: ffffffff802c7d3e R12: ffffffff8027c1e0 
May  8 09:32:37 x10 kernel: R13: ffff81014248fea8 R14: 0000000000000000 R15: ffff81025fd3d5c0 
May  8 09:32:37 x10 kernel: FS:  00002aaaac95a990(0000) GS:ffff81025fc6de40(0000) knlGS:0000000000000000 
May  8 09:32:37 x10 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b 
May  8 09:32:37 x10 kernel: CR2: 00002aaaab1abe20 CR3: 00000001b5525000 CR4: 00000000000006e0 
May  8 09:32:37 x10 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 
May  8 09:32:37 x10 kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 
May  8 09:32:37 x10 kernel:  
May  8 09:32:37 x10 kernel: Call Trace: 
May  8 09:32:37 x10 kernel:  [<ffffffff80279cba>] cached_lookup+0x18/0x86 
May  8 09:32:37 x10 kernel:  [<ffffffff80279d87>] lookup_hash+0x5f/0xb5 
May  8 09:32:37 x10 kernel:  [<ffffffff8027d146>] open_namei+0xf4/0x631 
May  8 09:32:37 x10 kernel:  [<ffffffff802738b1>] file_move+0x1d/0x51 
May  8 09:32:37 x10 kernel:  [<ffffffff8027148a>] __dentry_open+0x101/0x1ad 
May  8 09:32:37 x10 kernel:  [<ffffffff802715dc>] do_filp_open+0x1c/0x38 
May  8 09:32:37 x10 kernel:  [<ffffffff80271371>] get_unused_fd_flags+0x115/0x126 
May  8 09:32:37 x10 kernel:  [<ffffffff8027163e>] do_sys_open+0x46/0xca 
May  8 09:32:37 x10 kernel:  [<ffffffff8020b3ae>] system_call+0x7e/0x83

May 16 06:10:27 x3 kernel: BUG: soft lockup - CPU#1 stuck for 11s! [python:2684] 
May 16 06:10:27 x3 kernel: CPU 1: 
May 16 06:10:27 x3 kernel: Modules linked in: loop osc mgc lustre lov lquota mdc ko2iblnd ptlrpc obdclass lnet lvfs libcfs rdma_ucm ib_ucm rdma_cm iw_cm ib_addr ib_srp ib_ipoib ib_cm ib_sa ib_uverbs ib_umad binfmt_misc dm_mirror dm_multipath dm_mod battery ac sg sd_mod i5000_edac rng_core ata_piix ahci edac_core ib_mthca uhci_hcd i2c_i801 libata i2c_core ib_mad scsi_mod ib_core button shpchp ehci_hcd nfs nfs_acl lockd sunrpc e1000 
May 16 06:10:28 x3 kernel: Pid: 2684, comm: python Not tainted 2.6.23.17 #1

May 20 12:27:55 x14 kernel: BUG: soft lockup - CPU#0 stuck for 11s! [ps:18064] 
May 20 12:27:55 x14 kernel: CPU 0: 
May 20 12:27:55 x14 kernel: Modules linked in: loop osc mgc lustre lov lquota mdc ko2iblnd ptlrpc obdclass lnet lvfs libcfs rdma_ucm  ib_ucm rdma_cm iw_cm ib_addr ib_srp ib_ipoib ib_cm ib_sa ib_uverbs ib_umad binfmt_misc dm_mirror dm_multipath dm_mod battery
ac sg sd_mod i5000_edac edac_core rng_core ata_piix ehci_hcd ahci libata ib_mthca shpchp uhci_hcd scsi_mod ib_mad i2c_i801 ib_core i2c_core button nfs nfs_acl lockd sunrpc e1000 
May 20 12:27:55 x14 kernel: Pid: 18064, comm: ps Not tainted 2.6.23.17 #1 
May 20 12:27:55 x14 kernel: RIP: 0010:[<ffffffff80282cef>]  [<ffffffff80282cef>] __d_lookup+0xed/0x110 
May 20 12:27:55 x14 kernel: RSP: 0018:ffff81023bea7bb8  EFLAGS: 00000286 
May 20 12:27:55 x14 kernel: RAX: ffff81018ce277a0 RBX: ffff81018ce277a0 RCX: 0000000000000014 
May 20 12:27:55 x14 kernel: RDX: 000000000008af04 RSI: ffff81023bea7c88 RDI: ffff810202099dd8 
May 20 12:27:55 x14 kernel: RBP: ffff8101c18c49c0 R08: 0000000000000007 R09: 0000000000000007 
May 20 12:27:55 x14 kernel: R10: ffff81025fef7a18 R11: ffffffff802c7d3e R12: ffffffff88061600 
May 20 12:27:55 x14 kernel: R13: ffffffff88061640 R14: ffff8101c18c49c0 R15: ffffffff8045a22d 
May 20 12:27:55 x14 kernel: FS:  00002aaaab43cf30(0000) GS:ffffffff804d4000(0000) knlGS:0000000000000000 
May 20 12:27:55 x14 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b 
May 20 12:27:55 x14 kernel: CR2: 00002aaaab17a548 CR3: 000000012dad3000 CR4: 00000000000006e0 
May 20 12:27:56 x14 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 
May 20 12:27:56 x14 kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 
May 20 12:27:56 x14 kernel:  
May 20 12:27:56 x14 kernel: Call Trace: 
May 20 12:27:56 x14 kernel:  [<ffffffff80282cd5>] __d_lookup+0xd3/0x110 
May 20 12:27:56 x14 kernel:  [<ffffffff80279b1e>] do_lookup+0x2a/0x1ae 
May 20 12:27:56 x14 kernel:  [<ffffffff8027bc4d>] __link_path_walk+0x924/0xde9 
May 20 12:27:56 x14 kernel:  [<ffffffff8027c16a>] link_path_walk+0x58/0xe0 
May 20 12:27:56 x14 kernel:  [<ffffffff80271371>] get_unused_fd_flags+0x115/0x126 
May 20 12:27:56 x14 kernel:  [<ffffffff802d742e>] bitmap_scnprintf+0x8d/0xac 
May 20 12:27:56 x14 kernel:  [<ffffffff8027c536>] do_path_lookup+0x1ab/0x1cf 
May 20 12:27:56 x14 kernel:  [<ffffffff8027cf26>] __path_lookup_intent_open+0x56/0x96 
May 20 12:27:56 x14 kernel:  [<ffffffff8027d0ca>] open_namei+0x78/0x631 
May 20 12:27:56 x14 kernel:  [<ffffffff80257561>] __alloc_pages+0x6c/0x2c1 
May 20 12:27:56 x14 kernel:  [<ffffffff802715dc>] do_filp_open+0x1c/0x38 
May 20 12:27:56 x14 kernel:  [<ffffffff80271371>] get_unused_fd_flags+0x115/0x126 
May 20 12:27:56 x14 kernel:  [<ffffffff8027163e>] do_sys_open+0x46/0xca 
May 20 12:27:56 x14 kernel:  [<ffffffff8020b3ae>] system_call+0x7e/0x83 
May 20 12:27:56 x14 kernel:


More information about the lustre-discuss mailing list