[Lustre-discuss] Possible out of memory condition

Craig Tierney Craig.Tierney at noaa.gov
Wed Oct 22 13:37:44 PDT 2008


I just had two nodes hang with the following soft lockup messages.
I am running Centos 5.2 (2.6.18-93.1.13.el5) with the patchless client
(1.6.5.1).  My nodes do not have swap configured on them (no local
disks).   We do have a tool that looks for out of memory condition
and neither of the nodes in question reported a problem (not that it
is perfect).

Does the problem look like an issue with Lustre?

Thanks,
Craig


Oct 22 08:06:45 h53 kernel: BUG: soft lockup - CPU#2 stuck for 10s! [kswapd0:418]
Oct 22 08:06:45 h53 kernel: CPU 2:
Oct 22 08:06:45 h53 kernel: Modules linked in: zxfs(U) rs_iscsi(PFU) dm_mod(U) sd_mod(U) rs_cfg(PU) autofs4(U) mgc(U) lustre(U) 
lov(U) mdc(U) lquota(U) osc(U) ko2iblnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) libcfs(U) rdma_ucm(U) ib_sdp(U) rdma_cm(U) iw_cm(U) 
ib_addr(U) ib_ipoib(U) ipoib_helper(U) ib_cm(U) ib_sa(U) ipv6(U) xfrm_nalgo(U) crypto_api(U) ib_uverbs(U) ib_umad(U) mlx4_ib(U) 
mlx4_core(U) nfsd(U) exportfs(U) auth_rpcgss(U) video(U) sbs(U) backlight(U) i2c_ec(U) button(U) battery(U) asus_acpi(U) 
acpi_memhotplug(U) ac(U) parport_pc(U) lp(U) parport(U) joydev(U) sr_mod(U) cdrom(U) sg(U) ib_mthca(U) pcspkr(U) usb_storage(U) 
ata_piix(U) ib_mad(U) libata(U) ib_core(U) i2c_i801(U) i2c_core(U) shpchp(U) scsi_mod(U) serio_raw(U) ehci_hcd(U) ohci_hcd(U) 
uhci_hcd(U) nfs(U) lockd(U) fscache(U) nfs_acl(U) sunrpc(U) e1000(U) e1000e(U)
Oct 22 08:06:45 h53 kernel: Pid: 418, comm: kswapd0 Tainted: PF     2.6.18-92.1.13.el5 #1
Oct 22 08:06:45 h53 kernel: RIP: 0010:[<ffffffff80064be4>]  [<ffffffff80064be4>] .text.lock.spinlock+0x2/0x30
Oct 22 08:06:45 h53 kernel: RSP: 0000:ffff81042ea099b8  EFLAGS: 00000286
Oct 22 08:06:45 h53 kernel: RAX: ffff810423adeee8 RBX: ffff81035d272200 RCX: ffff8101c866eb28
Oct 22 08:06:45 h53 kernel: RDX: 0000000000000000 RSI: ffff810282e8eb10 RDI: ffff81035d272388
Oct 22 08:06:45 h53 kernel: RBP: 0000000000000086 R08: ffff81000001b600 R09: ffff81010af7be18
Oct 22 08:06:45 h53 kernel: R10: ffff810107b41a60 R11: ffffffff888117d0 R12: 0000000000000086
Oct 22 08:06:45 h53 kernel: R13: ffff81010d4dce88 R14: ffff81010eb8f0c0 R15: 0000000000000086
Oct 22 08:06:45 h53 kernel: FS:  0000000000000000(0000) GS:ffff81010eba1ec0(0000) knlGS:0000000000000000
Oct 22 08:06:45 h53 kernel: CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
Oct 22 08:06:45 h53 kernel: CR2: 00000000371b1000 CR3: 000000014772f000 CR4: 00000000000006e0
Oct 22 08:06:45 h53 kernel:
Oct 22 08:06:45 h53 kernel: Call Trace:
Oct 22 08:06:45 h53 kernel:  [<ffffffff8871125a>] :osc:cache_remove_extent+0x4a/0x90
Oct 22 08:06:45 h53 kernel:  [<ffffffff88707c5a>] :osc:osc_teardown_async_page+0x25a/0x3c0
Oct 22 08:06:45 h53 kernel:  [<ffffffff800d4589>] slab_destroy+0x80/0xa8
Oct 22 08:06:45 h53 kernel:  [<ffffffff800d46cd>] free_block+0x11c/0x13b
Oct 22 08:06:45 h53 kernel:  [<ffffffff8877c8e0>] :lov:lov_teardown_async_page+0x320/0x4e0
Oct 22 08:06:45 h53 kernel:  [<ffffffff801453a1>] radix_tree_delete+0x150/0x187
Oct 22 08:06:45 h53 kernel:  [<ffffffff887f7ba1>] :lustre:__ll_put_llap+0x3e1/0x610
Oct 22 08:06:45 h53 kernel:  [<ffffffff80023601>] __pagevec_free+0x21/0x2e
Oct 22 08:06:45 h53 kernel:  [<ffffffff887f830d>] :lustre:ll_removepage+0x1dd/0x240
Oct 22 08:06:45 h53 kernel:  [<ffffffff800c087c>] __remove_from_page_cache+0x1b/0x3a
Oct 22 08:06:45 h53 kernel:  [<ffffffff888117e0>] :lustre:ll_releasepage+0x10/0x20
Oct 22 08:06:45 h53 kernel:  [<ffffffff800c4fb1>] shrink_inactive_list+0x4e1/0x7f9
Oct 22 08:06:45 h53 kernel:  [<ffffffff800479c3>] __pagevec_release+0x19/0x22
Oct 22 08:06:45 h53 kernel:  [<ffffffff800c49af>] shrink_active_list+0x416/0x426
Oct 22 08:06:45 h53 kernel:  [<ffffffff80012c22>] shrink_zone+0xf6/0x11c
Oct 22 08:06:45 h53 kernel:  [<ffffffff8005792d>] kswapd+0x337/0x45e
Oct 22 08:06:45 h53 kernel:  [<ffffffff8009e30e>] autoremove_wake_function+0x0/0x2e
Oct 22 08:06:45 h53 kernel:  [<ffffffff8009e0e8>] keventd_create_kthread+0x0/0xc4
Oct 22 08:06:45 h53 kernel:  [<ffffffff800575f6>] kswapd+0x0/0x45e
Oct 22 08:06:45 h53 kernel:  [<ffffffff8009e0e8>] keventd_create_kthread+0x0/0xc4
Oct 22 08:06:45 h53 kernel:  [<ffffffff80032625>] kthread+0xfe/0x132
Oct 22 08:06:45 h53 kernel:  [<ffffffff8005dfb1>] child_rip+0xa/0x11
Oct 22 08:06:45 h53 kernel:  [<ffffffff8009e0e8>] keventd_create_kthread+0x0/0xc4
Oct 22 08:06:45 h53 kernel:  [<ffffffff80032527>] kthread+0x0/0x132
Oct 22 08:06:45 h53 kernel:  [<ffffffff8005dfa7>] child_rip+0x0/0x11
Oct 22 08:06:45 h53 kernel:
Oct 22 08:06:46 h53 kernel: BUG: soft lockup - CPU#4 stuck for 10s! [convert_emiss.e:27258]
Oct 22 08:06:46 h53 kernel: CPU 4:
Oct 22 08:06:46 h53 kernel: Modules linked in: zxfs(U) rs_iscsi(PFU) dm_mod(U) sd_mod(U) rs_cfg(PU) autofs4(U) mgc(U) lustre(U) 
lov(U) mdc(U) lquota(U) osc(U) ko2iblnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) libcfs(U) rdma_ucm(U) ib_sdp(U) rdma_cm(U) iw_cm(U) 
ib_addr(U) ib_ipoib(U) ipoib_helper(U) ib_cm(U) ib_sa(U) ipv6(U) xfrm_nalgo(U) crypto_api(U) ib_uverbs(U) ib_umad(U) mlx4_ib(U) 
mlx4_core(U) nfsd(U) exportfs(U) auth_rpcgss(U) video(U) sbs(U) backlight(U) i2c_ec(U) button(U) battery(U) asus_acpi(U) 
acpi_memhotplug(U) ac(U) parport_pc(U) lp(U) parport(U) joydev(U) sr_mod(U) cdrom(U) sg(U) ib_mthca(U) pcspkr(U) usb_storage(U) 
ata_piix(U) ib_mad(U) libata(U) ib_core(U) i2c_i801(U) i2c_core(U) shpchp(U) scsi_mod(U) serio_raw(U) ehci_hcd(U) ohci_hcd(U) 
uhci_hcd(U) nfs(U) lockd(U) fscache(U) nfs_acl(U) sunrpc(U) e1000(U) e1000e(U)
Oct 22 08:06:46 h53 kernel: Pid: 27258, comm: convert_emiss.e Tainted: PF     2.6.18-92.1.13.el5 #1
Oct 22 08:06:46 h53 kernel: RIP: 0010:[<ffffffff80064be7>]  [<ffffffff80064be7>] .text.lock.spinlock+0x5/0x30
Oct 22 08:06:46 h53 kernel: RSP: 0018:ffff8102ba4c3770  EFLAGS: 00000286
Oct 22 08:06:46 h53 kernel: RAX: ffff810423adeec0 RBX: ffff8102e617aad0 RCX: 0000000000000000
Oct 22 08:06:46 h53 kernel: RDX: 0000000000000000 RSI: 0000000004df5578 RDI: ffff810423adeec0
Oct 22 08:06:46 h53 kernel: RBP: 0000000000000000 R08: 0000000000000002 R09: 0000000000000000
Oct 22 08:06:46 h53 kernel: R10: ffff810164619cb0 R11: 00000000000000d0 R12: 0000000000000000
Oct 22 08:06:46 h53 kernel: R13: 0000000000000000 R14: ffff810423c36468 R15: ffff81042ebc4000
Oct 22 08:06:46 h53 kernel: FS:  00002b8ce6c66630(0000) GS:ffff81042fc23d40(0000) knlGS:0000000000000000
Oct 22 08:06:46 h53 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Oct 22 08:06:46 h53 kernel: CR2: 0000000001e46604 CR3: 0000000307228000 CR4: 00000000000006e0
Oct 22 08:06:46 h53 kernel:
Oct 22 08:06:46 h53 kernel: Call Trace:
Oct 22 08:06:46 h53 kernel:  [<ffffffff887066d5>] :osc:osc_queue_async_io+0x7c5/0x1020
Oct 22 08:06:46 h53 kernel:  [<ffffffff8024394c>] udp_push_pending_frames+0x236/0x25b
Oct 22 08:06:46 h53 kernel:  [<ffffffff80064b11>] _spin_lock_bh+0x9/0x14
Oct 22 08:06:46 h53 kernel:  [<ffffffff80030b4b>] release_sock+0x13/0xaa
Oct 22 08:06:46 h53 kernel:  [<ffffffff8005215c>] udp_sendmsg+0x47f/0x566
Oct 22 08:06:46 h53 kernel:  [<ffffffff887fcb44>] :lustre:ll_prepare_write+0x484/0x1360
Oct 22 08:06:46 h53 kernel:  [<ffffffff888117d0>] :lustre:ll_releasepage+0x0/0x20
Oct 22 08:06:46 h53 kernel:  [<ffffffff88628574>] :ptlrpc:ldlm_resource_get+0x1b4/0xa70
Oct 22 08:06:46 h53 kernel:  [<ffffffff8877daf3>] :lov:lov_queue_async_io+0x353/0x450
Oct 22 08:06:46 h53 kernel:  [<ffffffff886263c6>] :ptlrpc:ldlm_resource_putref+0x1b6/0x3a0
Oct 22 08:06:46 h53 kernel:  [<ffffffff887f3eb0>] :lustre:queue_or_sync_write+0x310/0x1270
Oct 22 08:06:46 h53 kernel:  [<ffffffff8002e341>] __wake_up+0x38/0x4f
Oct 22 08:06:46 h53 kernel:  [<ffffffff887faf3c>] :lustre:ll_commit_write+0x2dc/0x6e0
Oct 22 08:06:46 h53 kernel:  [<ffffffff88797d0f>] :lov:lov_finish_set+0x67f/0x6e0
Oct 22 08:06:46 h53 kernel:  [<ffffffff8879b455>] :lov:lov_set_add_req+0x15/0x20
Oct 22 08:06:46 h53 kernel:  [<ffffffff8000fe3c>] generic_file_buffered_write+0x4dc/0x6d3
Oct 22 08:06:46 h53 kernel:  [<ffffffff887e4fb9>] :lustre:ll_inode_size_unlock+0x89/0xe0
Oct 22 08:06:46 h53 kernel:  [<ffffffff8000df96>] current_fs_time+0x3b/0x40
Oct 22 08:06:46 h53 kernel:  [<ffffffff88798533>] :lov:lov_fini_cancel_set+0x1e3/0x260
Oct 22 08:06:46 h53 kernel:  [<ffffffff80016088>] __generic_file_aio_write_nolock+0x36c/0x3b8
Oct 22 08:06:46 h53 kernel:  [<ffffffff8863cf60>] :ptlrpc:ldlm_completion_ast+0x0/0x6a0
Oct 22 08:06:46 h53 kernel:  [<ffffffff887d9ce0>] :lustre:ll_glimpse_callback+0x0/0x440
Oct 22 08:06:46 h53 kernel:  [<ffffffff800c0727>] __generic_file_write_nolock+0x8f/0xa8
Oct 22 08:06:46 h53 kernel:  [<ffffffff887cf509>] :lustre:ll_extent_unlock+0x419/0x520
Oct 22 08:06:46 h53 kernel:  [<ffffffff8009e30e>] autoremove_wake_function+0x0/0x2e
Oct 22 08:06:46 h53 kernel:  [<ffffffff887d054a>] :lustre:ll_file_get_tree_lock_iov+0x44a/0x5e0
Oct 22 08:06:46 h53 kernel:  [<ffffffff8011fb2f>] avc_has_perm+0x43/0x55
Oct 22 08:06:46 h53 kernel:  [<ffffffff80063ae4>] mutex_lock+0xd/0x1d
Oct 22 08:06:46 h53 kernel:  [<ffffffff800c0788>] generic_file_writev+0x48/0xa2
Oct 22 08:06:46 h53 kernel:  [<ffffffff887d129f>] :lustre:ll_file_writev+0xbbf/0xe20
Oct 22 08:06:46 h53 kernel:  [<ffffffff887d151a>] :lustre:ll_file_write+0x1a/0x20
Oct 22 08:06:46 h53 kernel:  [<ffffffff8001649a>] vfs_write+0xce/0x174
Oct 22 08:06:46 h53 kernel:  [<ffffffff80016d67>] sys_write+0x45/0x6e
Oct 22 08:06:46 h53 kernel:  [<ffffffff8005d116>] system_call+0x7e/0x83
Oct 22 08:06:46 h53 kernel:



-- 
Craig Tierney (craig.tierney at noaa.gov)



More information about the lustre-discuss mailing list