[Lustre-discuss] Looping in __d_lookup
Robin Humble
rjh+lustre at cita.utoronto.ca
Wed May 21 01:18:39 PDT 2008
On Tue, May 20, 2008 at 10:24:25PM +0200, Jakob Goldbach wrote:
>> Hm, so you actually have a circular loop?
>Yes - I've asked for help on the OpenVZ list as well - Pavel Emelyanov
>provided me with a debug patch. This patch has now confirmed the loop in
>__d_lookup.
we're also seeing __d_lookup soft lockups. 4 are attached.
one was a cached_lookup but the first line of that fn is a __d_lookup
which is where I suspect is where the real soft lockup occurred. after a
while the node is toast and has to be rebooted.
kernel is 2.6.23.17 with patchless lustre 1.6.4.3, modified with
patches from bz 14250 (attachment 14109) and 13378 (attachment 12276).
using o2ib, x86_64, rhel5.1
cheers,
robin
ps. I'm glad to see some Lustre support for modern linux kernels and
would like to see a lot more! the VM's of distro kernel sometimes
behave erratically enough that we can't use them on Lustre client nodes
in production.
-------------- next part --------------
May 6 14:36:53 x6 kernel: BUG: soft lockup - CPU#2 stuck for 11s! [rsync:14200]
May 6 14:36:53 x6 kernel: CPU 2:
May 6 14:36:53 x6 kernel: Modules linked in: ext3 jbd loop osc mgc lustre lov lquota mdc ko2iblnd ptlrpc obdclass lnet lvfs libcfs rdma_ucm ib_ucm rdma_cm iw_cm ib_addr ib_srp ib_ipoib ib_cm ib_sa ib_uverbs ib_umad binfmt_misc dm_mirror dm_multipath dm_mod
battery ac sg sd_mod i2c_i801 i2c_core ahci ata_piix libata ib_mthca scsi_mod i5000_edac ib_mad edac_core ib_core ehci_hcd shpchp uhci_hcd rng_core button nfs nfs_acl lockd sunrpc e1000
May 6 14:36:53 x6 kernel: Pid: 14200, comm: rsync Not tainted 2.6.23.17 #1
May 6 14:36:53 x6 kernel: RIP: 0010:[<ffffffff80282cef>] [<ffffffff80282cef>] __d_lookup+0xed/0x110
May 6 14:36:53 x6 kernel: RSP: 0018:ffff81011b7ebbe8 EFLAGS: 00000286
May 6 14:36:53 x6 kernel: RAX: ffff81011d71aea0 RBX: ffff81011d71aea0 RCX: 0000000000000014
May 6 14:36:53 x6 kernel: RDX: 00000000000cb941 RSI: ffff81011b7ebcb8 RDI: ffff810120522cb8
May 6 14:36:53 x6 kernel: RBP: ffff8101d2707408 R08: 0000000000000007 R09: 0000000000000007
May 6 14:36:53 x6 kernel: R10: 00007fffffffb750 R11: ffffffff802c7d3e R12: ffffffff8853dc60
May 6 14:36:53 x6 kernel: R13: ffff8100992a5470 R14: ffff81024adb3af8 R15: ffff810219defe80
May 6 14:36:53 x6 kernel: FS: 00002aaaab6296e0(0000) GS:ffff81025fc6d840(0000) knlGS:0000000000000000
May 6 14:36:53 x6 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
May 6 14:36:53 x6 kernel: CR2: 000000000079f018 CR3: 000000018c1d0000 CR4: 00000000000006e0
May 6 14:36:53 x6 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
May 6 14:36:54 x6 kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
May 6 14:36:54 x6 kernel:
May 6 14:36:54 x6 kernel: Call Trace:
May 6 14:36:54 x6 kernel: [<ffffffff80282cd5>] __d_lookup+0xd3/0x110
May 6 14:36:54 x6 kernel: [<ffffffff80279b1e>] do_lookup+0x2a/0x1ae
May 6 14:36:54 x6 kernel: [<ffffffff8027bc4d>] __link_path_walk+0x924/0xde9
May 6 14:36:54 x6 kernel: [<ffffffff8027c16a>] link_path_walk+0x58/0xe0
May 6 14:36:54 x6 kernel: [<ffffffff8027c536>] do_path_lookup+0x1ab/0x1cf
May 6 14:36:54 x6 kernel: [<ffffffff8027afe4>] getname+0x14c/0x191
May 6 14:36:54 x6 kernel: [<ffffffff8027cd67>] __user_walk_fd+0x37/0x53
May 6 14:36:54 x6 kernel: [<ffffffff80275a7b>] vfs_lstat_fd+0x18/0x47
May 6 14:36:54 x6 kernel: [<ffffffff80275c6d>] sys_newlstat+0x19/0x31
May 6 14:36:54 x6 kernel: [<ffffffff8020b3ae>] system_call+0x7e/0x83
May 8 09:32:36 x10 kernel: BUG: soft lockup - CPU#1 stuck for 11s! [bonnie++.mpi:31720]
May 8 09:32:36 x10 kernel: CPU 1:
May 8 09:32:36 x10 kernel: Modules linked in: loop osc mgc lustre lov lquota mdc ko2iblnd ptlrpc obdclass lnet lvfs libcfs rdma_ucm ib_ucm rdma_cm iw_cm ib_addr ib_srp ib_ipoib ib_cm ib_sa ib_uverbs ib_umad binfmt_misc dm_mirror dm_multipath dm_mod battery
ac sg sd_mod i5000_edac edac_core ehci_hcd ahci rng_core ata_piix libata i2c_i801 scsi_mod i2c_core ib_mthca ib_mad shpchp uhci_hcd ib_core button nfs nfs_acl lockd sunrpc e1000
May 8 09:32:36 x10 kernel: Pid: 31720, comm: bonnie++.mpi Not tainted 2.6.23.17 #1
May 8 09:32:36 x10 kernel: RIP: 0010:[<ffffffff80282cef>] [<ffffffff80282cef>] __d_lookup+0xed/0x110
May 8 09:32:36 x10 kernel: RSP: 0018:ffff81014248fd78 EFLAGS: 00000286
May 8 09:32:36 x10 kernel: RAX: ffff8101a79cc500 RBX: ffff8101a79cc500 RCX: 0000000000000014
May 8 09:32:36 x10 kernel: RDX: 00000000000ac16f RSI: ffff81014248feb8 RDI: ffff8101ff617898
May 8 09:32:36 x10 kernel: RBP: ffff8101ff617898 R08: 0000000000000000 R09: ffff81025fd3d5c0
May 8 09:32:37 x10 kernel: R10: 0000000000000001 R11: ffffffff802c7d3e R12: ffffffff8027c1e0
May 8 09:32:37 x10 kernel: R13: ffff81014248fea8 R14: 0000000000000000 R15: ffff81025fd3d5c0
May 8 09:32:37 x10 kernel: FS: 00002aaaac95a990(0000) GS:ffff81025fc6de40(0000) knlGS:0000000000000000
May 8 09:32:37 x10 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
May 8 09:32:37 x10 kernel: CR2: 00002aaaab1abe20 CR3: 00000001b5525000 CR4: 00000000000006e0
May 8 09:32:37 x10 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
May 8 09:32:37 x10 kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
May 8 09:32:37 x10 kernel:
May 8 09:32:37 x10 kernel: Call Trace:
May 8 09:32:37 x10 kernel: [<ffffffff80279cba>] cached_lookup+0x18/0x86
May 8 09:32:37 x10 kernel: [<ffffffff80279d87>] lookup_hash+0x5f/0xb5
May 8 09:32:37 x10 kernel: [<ffffffff8027d146>] open_namei+0xf4/0x631
May 8 09:32:37 x10 kernel: [<ffffffff802738b1>] file_move+0x1d/0x51
May 8 09:32:37 x10 kernel: [<ffffffff8027148a>] __dentry_open+0x101/0x1ad
May 8 09:32:37 x10 kernel: [<ffffffff802715dc>] do_filp_open+0x1c/0x38
May 8 09:32:37 x10 kernel: [<ffffffff80271371>] get_unused_fd_flags+0x115/0x126
May 8 09:32:37 x10 kernel: [<ffffffff8027163e>] do_sys_open+0x46/0xca
May 8 09:32:37 x10 kernel: [<ffffffff8020b3ae>] system_call+0x7e/0x83
May 16 06:10:27 x3 kernel: BUG: soft lockup - CPU#1 stuck for 11s! [python:2684]
May 16 06:10:27 x3 kernel: CPU 1:
May 16 06:10:27 x3 kernel: Modules linked in: loop osc mgc lustre lov lquota mdc ko2iblnd ptlrpc obdclass lnet lvfs libcfs rdma_ucm ib_ucm rdma_cm iw_cm ib_addr ib_srp ib_ipoib ib_cm ib_sa ib_uverbs ib_umad binfmt_misc dm_mirror dm_multipath dm_mod battery ac sg sd_mod i5000_edac rng_core ata_piix ahci edac_core ib_mthca uhci_hcd i2c_i801 libata i2c_core ib_mad scsi_mod ib_core button shpchp ehci_hcd nfs nfs_acl lockd sunrpc e1000
May 16 06:10:28 x3 kernel: Pid: 2684, comm: python Not tainted 2.6.23.17 #1
May 20 12:27:55 x14 kernel: BUG: soft lockup - CPU#0 stuck for 11s! [ps:18064]
May 20 12:27:55 x14 kernel: CPU 0:
May 20 12:27:55 x14 kernel: Modules linked in: loop osc mgc lustre lov lquota mdc ko2iblnd ptlrpc obdclass lnet lvfs libcfs rdma_ucm ib_ucm rdma_cm iw_cm ib_addr ib_srp ib_ipoib ib_cm ib_sa ib_uverbs ib_umad binfmt_misc dm_mirror dm_multipath dm_mod battery
ac sg sd_mod i5000_edac edac_core rng_core ata_piix ehci_hcd ahci libata ib_mthca shpchp uhci_hcd scsi_mod ib_mad i2c_i801 ib_core i2c_core button nfs nfs_acl lockd sunrpc e1000
May 20 12:27:55 x14 kernel: Pid: 18064, comm: ps Not tainted 2.6.23.17 #1
May 20 12:27:55 x14 kernel: RIP: 0010:[<ffffffff80282cef>] [<ffffffff80282cef>] __d_lookup+0xed/0x110
May 20 12:27:55 x14 kernel: RSP: 0018:ffff81023bea7bb8 EFLAGS: 00000286
May 20 12:27:55 x14 kernel: RAX: ffff81018ce277a0 RBX: ffff81018ce277a0 RCX: 0000000000000014
May 20 12:27:55 x14 kernel: RDX: 000000000008af04 RSI: ffff81023bea7c88 RDI: ffff810202099dd8
May 20 12:27:55 x14 kernel: RBP: ffff8101c18c49c0 R08: 0000000000000007 R09: 0000000000000007
May 20 12:27:55 x14 kernel: R10: ffff81025fef7a18 R11: ffffffff802c7d3e R12: ffffffff88061600
May 20 12:27:55 x14 kernel: R13: ffffffff88061640 R14: ffff8101c18c49c0 R15: ffffffff8045a22d
May 20 12:27:55 x14 kernel: FS: 00002aaaab43cf30(0000) GS:ffffffff804d4000(0000) knlGS:0000000000000000
May 20 12:27:55 x14 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
May 20 12:27:55 x14 kernel: CR2: 00002aaaab17a548 CR3: 000000012dad3000 CR4: 00000000000006e0
May 20 12:27:56 x14 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
May 20 12:27:56 x14 kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
May 20 12:27:56 x14 kernel:
May 20 12:27:56 x14 kernel: Call Trace:
May 20 12:27:56 x14 kernel: [<ffffffff80282cd5>] __d_lookup+0xd3/0x110
May 20 12:27:56 x14 kernel: [<ffffffff80279b1e>] do_lookup+0x2a/0x1ae
May 20 12:27:56 x14 kernel: [<ffffffff8027bc4d>] __link_path_walk+0x924/0xde9
May 20 12:27:56 x14 kernel: [<ffffffff8027c16a>] link_path_walk+0x58/0xe0
May 20 12:27:56 x14 kernel: [<ffffffff80271371>] get_unused_fd_flags+0x115/0x126
May 20 12:27:56 x14 kernel: [<ffffffff802d742e>] bitmap_scnprintf+0x8d/0xac
May 20 12:27:56 x14 kernel: [<ffffffff8027c536>] do_path_lookup+0x1ab/0x1cf
May 20 12:27:56 x14 kernel: [<ffffffff8027cf26>] __path_lookup_intent_open+0x56/0x96
May 20 12:27:56 x14 kernel: [<ffffffff8027d0ca>] open_namei+0x78/0x631
May 20 12:27:56 x14 kernel: [<ffffffff80257561>] __alloc_pages+0x6c/0x2c1
May 20 12:27:56 x14 kernel: [<ffffffff802715dc>] do_filp_open+0x1c/0x38
May 20 12:27:56 x14 kernel: [<ffffffff80271371>] get_unused_fd_flags+0x115/0x126
May 20 12:27:56 x14 kernel: [<ffffffff8027163e>] do_sys_open+0x46/0xca
May 20 12:27:56 x14 kernel: [<ffffffff8020b3ae>] system_call+0x7e/0x83
May 20 12:27:56 x14 kernel:
More information about the lustre-discuss
mailing list