[Lustre-discuss] CPU stuck errors

Erik Froese erik.froese at gmail.com
Fri Jan 1 16:27:26 PST 2010


I think you're right. I'm re fsking it right now.
Erik

On Thu, Dec 31, 2009 at 1:29 PM, Aaron Knister <aaron.knister at gmail.com>wrote:

> My best guess is that there's something still not right on-disk. This is my
> $.02, but I'd suggest running another fsck on that ost if you haven't
> already.
>
> On Dec 31, 2009, at 12:20 PM, Erik Froese wrote:
>
> > Yesterday we had an OST fail. I had to e2fsck it to fix it (although
> there was some corruption).
> >
> > All servers are:
> > RH 5.3 with Lustre 1.8.1.1
> > Mellanox QDR IB
> > Fiber connected storage.
> >
> > I left the OST mounted yesterday but it was deactivated on the mds.
> > [root at mds-0-0 osc]# cat /proc/fs/lustre/osc/scratch-OST000e-osc/active
> > 0
> >
> > At some point this morning the OSS locked up and had to be rebooted. I
> couldn't even access it on the console.
> >
> > I see I/O errors when trying to copy files that exist on that OST (lfs
> find -O scratch-OST000e_UUID /scratch)
> >
> > Now we're seeing CPU lockups on the OSS
> >
> > Soft lockup - CPU#4 stuck for 10s! [ll_ost_26:10467]
> > Dec 31 12:12:12 oss-0-0 kernel: BUG: soft lockup - CPU#2 stuck for 10s!
> [ll_ost_500:12066]
> > Dec 31 12:12:22 oss-0-0 kernel: BUG: soft lockup - CPU#4 stuck for 10s!
> [ll_ost_26:10467]
> > Dec 31 12:12:22 oss-0-0 kernel: BUG: soft lockup - CPU#2 stuck for 10s!
> [ll_ost_500:12066]
> > Dec 31 12:12:32 oss-0-0 kernel: BUG: soft lockup - CPU#4 stuck for 10s!
> [ll_ost_26:10467]
> > Dec 31 12:12:32 oss-0-0 kernel: BUG: soft lockup - CPU#2 stuck for 10s!
> [ll_ost_500:12066]
> > Dec 31 12:12:42 oss-0-0 kernel: BUG: soft lockup - CPU#4 stuck for 10s!
> [ll_ost_26:10467]
> > Dec 31 12:12:42 oss-0-0 kernel: BUG: soft lockup - CPU#2 stuck for 10s!
> [ll_ost_500:12066]
> >
> > dmesg:
> >
> > BUG: soft lockup - CPU#2 stuck for 10s! [ll_ost_500:12066]
> > CPU 2:
> > Modules linked in: obdfilter(U) fsfilt_ldiskfs(U) mgc(U) ldiskfs(U)
> crc16(U) ost(U) lustre(U) lov(U) mdc(U) lquota(U) osc(U) ptlrpc(U)
> obdclass(U) lvfs(U) ksocklnd(U) ko2iblnd(
> > U) lnet(U) libcfs(U) autofs4(U) ipmi_devintf(U) ipmi_si(U)
> ipmi_msghandler(U) sunrpc(U) ip_conntrack_netbios_ns(U) ipt_REJECT(U)
> xt_state(U) ip_conntrack(U) nfnetlink(U) iptabl
> > e_filter(U) ip_tables(U) ip6t_REJECT(U) xt_tcpudp(U) ip6table_filter(U)
> ip6_tables(U) x_tables(U) rdma_ucm(U) ib_sdp(U) rdma_cm(U) iw_cm(U)
> ib_addr(U) ib_ipoib(U) ipoib_helper(
> > U) ib_cm(U) ib_sa(U) ipv6(U) xfrm_nalgo(U) crypto_api(U) ib_uverbs(U)
> ib_umad(U) mlx4_ib(U) ib_mthca(U) ib_mad(U) ib_core(U) dm_mirror(U)
> dm_multipath(U) scsi_dh(U) video(U) hw
> > mon(U) backlight(U) sbs(U) i2c_ec(U) button(U) battery(U) asus_acpi(U)
> acpi_memhotplug(U) ac(U) parport_pc(U) lp(U) parport(U) joydev(U) sr_mod(U)
> cdrom(U) mlx4_core(U) pcspkr(
> > U) igb(U) i2c_i801(U) i2c_core(U) dm_raid45(U) dm_message(U)
> dm_region_hash(U) dm_log(U) dm_mod(U) dm_mem_cache(U) mppVhba(U)
> usb_storage(U) qla2xxx(U) scsi_transport_fc(U) ahc
> > i(U) libata(U) shpchp(U) aacraid(U) mppUpper(U) sg(U) sd_mod(U)
> scsi_mod(U) ext3(U) jbd(U) uhci_hcd(U) ohci_hcd(U) ehci_hcd(U)
> > Pid: 12066, comm: ll_ost_500 Tainted: G
>  2.6.18-128.7.1.el5_lustre.1.8.1.1 #1
> > RIP: 0010:[<ffffffff88b3bb64>]  [<ffffffff88b3bb64>]
> :ldiskfs:ldiskfs_find_entry+0x244/0x5c0
> > RSP: 0018:ffff8106656e9610  EFLAGS: 00000202
> > RAX: 0000000000000000 RBX: 0000000000000007 RCX: 0000000000000004
> > RDX: ffff81031a0f3000 RSI: ffff81033a504b4f RDI: ffff8103551591fb
> > RBP: 0000000000000002 R08: ffff810355159ff8 R09: ffff810355159000
> > R10: ffff810001014f00 R11: 000000004b3cd2fd R12: ffff81033a5d7550
> > R13: ffffffff80063b4c R14: ffff8106656e96c8 R15: ffffffff80014fae
> > FS:  00002aeb511bd220(0000) GS:ffff81010c499240(0000)
> knlGS:0000000000000000
> > CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> > CR2: 00000039f5499a50 CR3: 0000000000201000 CR4: 00000000000006e0
> >
> > Call Trace:
> >  [<ffffffff80089bdd>] dequeue_task+0x18/0x37
> >  [<ffffffff80063098>] thread_return+0x62/0xfe
> >  [<ffffffff88b3df63>] :ldiskfs:ldiskfs_lookup+0x53/0x290
> >  [<ffffffff800366e8>] __lookup_hash+0x10b/0x130
> >  [<ffffffff80063d2a>] .text.lock.mutex+0x5/0x14
> >  [<ffffffff800e2c9b>] lookup_one_len+0x53/0x61
> >  [<ffffffff88ba61ed>] :obdfilter:filter_fid2dentry+0x42d/0x730
> >  [<ffffffff8026f08b>] __down_trylock+0x44/0x4e
> >  [<ffffffff800647c0>] __down_failed_trylock+0x35/0x3a
> >  [<ffffffff88bbff9b>] :obdfilter:filter_lvbo_init+0x3bb/0x68b
> >  [<ffffffff888da2e6>] :ptlrpc:ldlm_resource_get+0x8f6/0xa50
> >  [<ffffffff88b084a0>] :ost:ost_blocking_ast+0x0/0xaa0
> >  [<ffffffff888f9470>] :ptlrpc:ldlm_server_completion_ast+0x0/0x5d0
> >  [<ffffffff888d0eba>] :ptlrpc:ldlm_lock_create+0xba/0x9f0
> >  [<ffffffff889158d1>] :ptlrpc:lustre_swab_buf+0x81/0x170
> >  [<ffffffff888e1adb>] :ptlrpc:target_queue_recovery_request+0x9b/0x1690
> >  [<ffffffff888f3f30>] :ptlrpc:ldlm_server_glimpse_ast+0x0/0x3b0
> >  [<ffffffff888f3f30>] :ptlrpc:ldlm_server_glimpse_ast+0x0/0x3b0
> >  [<ffffffff888f9470>] :ptlrpc:ldlm_server_completion_ast+0x0/0x5d0
> >  [<ffffffff88b084a0>] :ost:ost_blocking_ast+0x0/0xaa0
> >  [<ffffffff888f75a0>] :ptlrpc:ldlm_handle_enqueue+0x670/0x1210
> >  [<ffffffff889145d8>] :ptlrpc:lustre_msg_check_version_v2+0x8/0x20
> >  [<ffffffff88b107b3>] :ost:ost_handle+0x54b3/0x5a70
> >  [<ffffffff800d73cd>] free_block+0x126/0x143
> >  [<ffffffff88758305>] :lnet:lnet_match_blocked_msg+0x375/0x390
> >  [<ffffffff88919f05>] :ptlrpc:lustre_msg_get_conn_cnt+0x35/0xf0
> >  [<ffffffff80089d8d>] enqueue_task+0x41/0x56
> >  [<ffffffff8891ec1d>] :ptlrpc:ptlrpc_check_req+0x1d/0x110
> >  [<ffffffff88921357>] :ptlrpc:ptlrpc_server_handle_request+0xa97/0x1170
> >  [<ffffffff80063098>] thread_return+0x62/0xfe
> >  [<ffffffff8003d382>] lock_timer_base+0x1b/0x3c
> >  [<ffffffff8001c6fa>] __mod_timer+0xb0/0xbe
> >  [<ffffffff88924e08>] :ptlrpc:ptlrpc_main+0x1218/0x13e0
> >  [<ffffffff8008a3f3>] default_wake_function+0x0/0xe
> >  [<ffffffff800b48f2>] audit_syscall_exit+0x33c/0x357
> >  [<ffffffff8005dfb1>] child_rip+0xa/0x11
> >  [<ffffffff88923bf0>] :ptlrpc:ptlrpc_main+0x0/0x13e0
> >  [<ffffffff8005dfa7>] child_rip+0x0/0x11
> >
> > On the MDS I se this in dmesg:
> > BUG: soft lockup - CPU#1 stuck for 10s! [ll_evictor:11659]
> > CPU 1:
> > Modules linked in: mds(U) fsfilt_ldiskfs(U) mgs(U) mgc(U) ldiskfs(U)
> lustre(U) lov(U) mdc(U) lquota(U) osc(U) ptlrpc(U) obdclass(U) lvfs(U)
> nfs(U) lockd(U) fscache(U) nfs_acl(U
> > ) ksocklnd(U) ko2iblnd(U) lnet(U) libcfs(U) raid10(U) crc16(U) raid1(U)
> autofs4(U) ipmi_devintf(U) ipmi_si(U) ipmi_msghandler(U) sunrpc(U)
> ip_conntrack_netbios_ns(U) ipt_REJECT
> > (U) xt_state(U) ip_conntrack(U) nfnetlink(U) iptable_filter(U)
> ip_tables(U) ip6t_REJECT(U) xt_tcpudp(U) ip6table_filter(U) ip6_tables(U)
> x_tables(U) cpufreq_ondemand(U) acpi_cp
> > ufreq(U) freq_table(U) rdma_ucm(U) ib_sdp(U) rdma_cm(U) iw_cm(U)
> ib_addr(U) ib_ipoib(U) ipoib_helper(U) ib_cm(U) ib_sa(U) ipv6(U)
> xfrm_nalgo(U) crypto_api(U) ib_uverbs(U) ib_um
> > ad(U) mlx4_ib(U) ib_mthca(U) ib_mad(U) ib_core(U) dm_mirror(U)
> dm_multipath(U) scsi_dh(U) video(U) hwmon(U) backlight(U) sbs(U) i2c_ec(U)
> button(U) battery(U) asus_acpi(U) acpi
> > _memhotplug(U) ac(U) parport_pc(U) lp(U) parport(U) joydev(U) sr_mod(U)
> cdrom(U) mlx4_core(U) igb(U) i2c_i801(U) i2c_core(U) pcspkr(U) dm_raid45(U)
> dm_message(U) dm_region_hash
> > (U) dm_log(U) dm_mod(U) dm_mem_cache(U) mppVhba(U) usb_storage(U) ahci(U)
> libata(U) mptsas(U) mptscsih(U) scsi_transport_sas(U) mptbase(U) shpchp(U)
> aacraid(U) mppUpper(U) sg(U
> > ) sd_mod(U) scsi_mod(U) ext3(U) jbd(U) uhci_hcd(U) ohci_hcd(U)
> ehci_hcd(U)
> > Pid: 11659, comm: ll_evictor Tainted: G
>  2.6.18-128.7.1.el5_lustre.1.8.1.1 #1
> > RIP: 0010:[<ffffffff80064aee>]  [<ffffffff80064aee>] _write_lock+0x7/0xf
> > RSP: 0018:ffff81034fcefc78  EFLAGS: 00000246
> > RAX: 000000000000ffff RBX: 000000000000a09f RCX: 00000000000d0117
> > RDX: 0000000000000195 RSI: ffffffff802fae80 RDI: ffffc20011f029fc
> > RBP: 0000000000000286 R08: ffff81000100e8e0 R09: 0000000000000000
> > R10: ffff8106070f5200 R11: 0000000000000150 R12: 0000000000000286
> > R13: ffff81034fcefc20 R14: ffff8106070f5258 R15: ffff8106070f5200
> > FS:  00002af849256220(0000) GS:ffff81010c4994c0(0000)
> knlGS:0000000000000000
> > CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> > CR2: 000000386ba41900 CR3: 0000000000201000 CR4: 00000000000006e0
> >
> > Call Trace:
> >  [<ffffffff88830ba7>] :obdclass:lustre_hash_for_each_empty+0x237/0x2b0
> >  [<ffffffff88837ae8>] :obdclass:class_disconnect+0x398/0x420
> >  [<ffffffff88bc75e1>] :mds:mds_disconnect+0x121/0xe40
> >  [<ffffffff8014b87a>] snprintf+0x44/0x4c
> >  [<ffffffff88833994>] :obdclass:class_fail_export+0x384/0x4c0
> >  [<ffffffff88904238>] :ptlrpc:ping_evictor_main+0x4f8/0x7e0
> >  [<ffffffff8008a3f3>] default_wake_function+0x0/0xe
> >  [<ffffffff800b48f2>] audit_syscall_exit+0x33c/0x357
> >  [<ffffffff8005dfb1>] child_rip+0xa/0x11
> >  [<ffffffff88903d40>] :ptlrpc:ping_evictor_main+0x0/0x7e0
> >  [<ffffffff8005dfa7>] child_rip+0x0/0x11
> >
> > Can anyone shed some light on this?
> >
> > Thanks
> > Erik
> >
> >
> > _______________________________________________
> > Lustre-discuss mailing list
> > Lustre-discuss at lists.lustre.org
> > http://lists.lustre.org/mailman/listinfo/lustre-discuss
>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20100101/27a2a112/attachment.htm>


More information about the lustre-discuss mailing list