[Lustre-discuss] CPU stuck errors

Erik Froese erik.froese at gmail.com
Sat Jan 2 09:33:55 PST 2010


After the e2fsck I was able to mount the OST cleanly and the CPU locks went
away.
Thanks!
Erik

On Fri, Jan 1, 2010 at 7:27 PM, Erik Froese <erik.froese at gmail.com> wrote:

> I think you're right. I'm re fsking it right now.
> Erik
>
>
> On Thu, Dec 31, 2009 at 1:29 PM, Aaron Knister <aaron.knister at gmail.com>wrote:
>
>> My best guess is that there's something still not right on-disk. This is
>> my $.02, but I'd suggest running another fsck on that ost if you haven't
>> already.
>>
>> On Dec 31, 2009, at 12:20 PM, Erik Froese wrote:
>>
>> > Yesterday we had an OST fail. I had to e2fsck it to fix it (although
>> there was some corruption).
>> >
>> > All servers are:
>> > RH 5.3 with Lustre 1.8.1.1
>> > Mellanox QDR IB
>> > Fiber connected storage.
>> >
>> > I left the OST mounted yesterday but it was deactivated on the mds.
>> > [root at mds-0-0 osc]# cat /proc/fs/lustre/osc/scratch-OST000e-osc/active
>> > 0
>> >
>> > At some point this morning the OSS locked up and had to be rebooted. I
>> couldn't even access it on the console.
>> >
>> > I see I/O errors when trying to copy files that exist on that OST (lfs
>> find -O scratch-OST000e_UUID /scratch)
>> >
>> > Now we're seeing CPU lockups on the OSS
>> >
>> > Soft lockup - CPU#4 stuck for 10s! [ll_ost_26:10467]
>> > Dec 31 12:12:12 oss-0-0 kernel: BUG: soft lockup - CPU#2 stuck for 10s!
>> [ll_ost_500:12066]
>> > Dec 31 12:12:22 oss-0-0 kernel: BUG: soft lockup - CPU#4 stuck for 10s!
>> [ll_ost_26:10467]
>> > Dec 31 12:12:22 oss-0-0 kernel: BUG: soft lockup - CPU#2 stuck for 10s!
>> [ll_ost_500:12066]
>> > Dec 31 12:12:32 oss-0-0 kernel: BUG: soft lockup - CPU#4 stuck for 10s!
>> [ll_ost_26:10467]
>> > Dec 31 12:12:32 oss-0-0 kernel: BUG: soft lockup - CPU#2 stuck for 10s!
>> [ll_ost_500:12066]
>> > Dec 31 12:12:42 oss-0-0 kernel: BUG: soft lockup - CPU#4 stuck for 10s!
>> [ll_ost_26:10467]
>> > Dec 31 12:12:42 oss-0-0 kernel: BUG: soft lockup - CPU#2 stuck for 10s!
>> [ll_ost_500:12066]
>> >
>> > dmesg:
>> >
>> > BUG: soft lockup - CPU#2 stuck for 10s! [ll_ost_500:12066]
>> > CPU 2:
>> > Modules linked in: obdfilter(U) fsfilt_ldiskfs(U) mgc(U) ldiskfs(U)
>> crc16(U) ost(U) lustre(U) lov(U) mdc(U) lquota(U) osc(U) ptlrpc(U)
>> obdclass(U) lvfs(U) ksocklnd(U) ko2iblnd(
>> > U) lnet(U) libcfs(U) autofs4(U) ipmi_devintf(U) ipmi_si(U)
>> ipmi_msghandler(U) sunrpc(U) ip_conntrack_netbios_ns(U) ipt_REJECT(U)
>> xt_state(U) ip_conntrack(U) nfnetlink(U) iptabl
>> > e_filter(U) ip_tables(U) ip6t_REJECT(U) xt_tcpudp(U) ip6table_filter(U)
>> ip6_tables(U) x_tables(U) rdma_ucm(U) ib_sdp(U) rdma_cm(U) iw_cm(U)
>> ib_addr(U) ib_ipoib(U) ipoib_helper(
>> > U) ib_cm(U) ib_sa(U) ipv6(U) xfrm_nalgo(U) crypto_api(U) ib_uverbs(U)
>> ib_umad(U) mlx4_ib(U) ib_mthca(U) ib_mad(U) ib_core(U) dm_mirror(U)
>> dm_multipath(U) scsi_dh(U) video(U) hw
>> > mon(U) backlight(U) sbs(U) i2c_ec(U) button(U) battery(U) asus_acpi(U)
>> acpi_memhotplug(U) ac(U) parport_pc(U) lp(U) parport(U) joydev(U) sr_mod(U)
>> cdrom(U) mlx4_core(U) pcspkr(
>> > U) igb(U) i2c_i801(U) i2c_core(U) dm_raid45(U) dm_message(U)
>> dm_region_hash(U) dm_log(U) dm_mod(U) dm_mem_cache(U) mppVhba(U)
>> usb_storage(U) qla2xxx(U) scsi_transport_fc(U) ahc
>> > i(U) libata(U) shpchp(U) aacraid(U) mppUpper(U) sg(U) sd_mod(U)
>> scsi_mod(U) ext3(U) jbd(U) uhci_hcd(U) ohci_hcd(U) ehci_hcd(U)
>> > Pid: 12066, comm: ll_ost_500 Tainted: G
>>  2.6.18-128.7.1.el5_lustre.1.8.1.1 #1
>> > RIP: 0010:[<ffffffff88b3bb64>]  [<ffffffff88b3bb64>]
>> :ldiskfs:ldiskfs_find_entry+0x244/0x5c0
>> > RSP: 0018:ffff8106656e9610  EFLAGS: 00000202
>> > RAX: 0000000000000000 RBX: 0000000000000007 RCX: 0000000000000004
>> > RDX: ffff81031a0f3000 RSI: ffff81033a504b4f RDI: ffff8103551591fb
>> > RBP: 0000000000000002 R08: ffff810355159ff8 R09: ffff810355159000
>> > R10: ffff810001014f00 R11: 000000004b3cd2fd R12: ffff81033a5d7550
>> > R13: ffffffff80063b4c R14: ffff8106656e96c8 R15: ffffffff80014fae
>> > FS:  00002aeb511bd220(0000) GS:ffff81010c499240(0000)
>> knlGS:0000000000000000
>> > CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
>> > CR2: 00000039f5499a50 CR3: 0000000000201000 CR4: 00000000000006e0
>> >
>> > Call Trace:
>> >  [<ffffffff80089bdd>] dequeue_task+0x18/0x37
>> >  [<ffffffff80063098>] thread_return+0x62/0xfe
>> >  [<ffffffff88b3df63>] :ldiskfs:ldiskfs_lookup+0x53/0x290
>> >  [<ffffffff800366e8>] __lookup_hash+0x10b/0x130
>> >  [<ffffffff80063d2a>] .text.lock.mutex+0x5/0x14
>> >  [<ffffffff800e2c9b>] lookup_one_len+0x53/0x61
>> >  [<ffffffff88ba61ed>] :obdfilter:filter_fid2dentry+0x42d/0x730
>> >  [<ffffffff8026f08b>] __down_trylock+0x44/0x4e
>> >  [<ffffffff800647c0>] __down_failed_trylock+0x35/0x3a
>> >  [<ffffffff88bbff9b>] :obdfilter:filter_lvbo_init+0x3bb/0x68b
>> >  [<ffffffff888da2e6>] :ptlrpc:ldlm_resource_get+0x8f6/0xa50
>> >  [<ffffffff88b084a0>] :ost:ost_blocking_ast+0x0/0xaa0
>> >  [<ffffffff888f9470>] :ptlrpc:ldlm_server_completion_ast+0x0/0x5d0
>> >  [<ffffffff888d0eba>] :ptlrpc:ldlm_lock_create+0xba/0x9f0
>> >  [<ffffffff889158d1>] :ptlrpc:lustre_swab_buf+0x81/0x170
>> >  [<ffffffff888e1adb>] :ptlrpc:target_queue_recovery_request+0x9b/0x1690
>> >  [<ffffffff888f3f30>] :ptlrpc:ldlm_server_glimpse_ast+0x0/0x3b0
>> >  [<ffffffff888f3f30>] :ptlrpc:ldlm_server_glimpse_ast+0x0/0x3b0
>> >  [<ffffffff888f9470>] :ptlrpc:ldlm_server_completion_ast+0x0/0x5d0
>> >  [<ffffffff88b084a0>] :ost:ost_blocking_ast+0x0/0xaa0
>> >  [<ffffffff888f75a0>] :ptlrpc:ldlm_handle_enqueue+0x670/0x1210
>> >  [<ffffffff889145d8>] :ptlrpc:lustre_msg_check_version_v2+0x8/0x20
>> >  [<ffffffff88b107b3>] :ost:ost_handle+0x54b3/0x5a70
>> >  [<ffffffff800d73cd>] free_block+0x126/0x143
>> >  [<ffffffff88758305>] :lnet:lnet_match_blocked_msg+0x375/0x390
>> >  [<ffffffff88919f05>] :ptlrpc:lustre_msg_get_conn_cnt+0x35/0xf0
>> >  [<ffffffff80089d8d>] enqueue_task+0x41/0x56
>> >  [<ffffffff8891ec1d>] :ptlrpc:ptlrpc_check_req+0x1d/0x110
>> >  [<ffffffff88921357>] :ptlrpc:ptlrpc_server_handle_request+0xa97/0x1170
>> >  [<ffffffff80063098>] thread_return+0x62/0xfe
>> >  [<ffffffff8003d382>] lock_timer_base+0x1b/0x3c
>> >  [<ffffffff8001c6fa>] __mod_timer+0xb0/0xbe
>> >  [<ffffffff88924e08>] :ptlrpc:ptlrpc_main+0x1218/0x13e0
>> >  [<ffffffff8008a3f3>] default_wake_function+0x0/0xe
>> >  [<ffffffff800b48f2>] audit_syscall_exit+0x33c/0x357
>> >  [<ffffffff8005dfb1>] child_rip+0xa/0x11
>> >  [<ffffffff88923bf0>] :ptlrpc:ptlrpc_main+0x0/0x13e0
>> >  [<ffffffff8005dfa7>] child_rip+0x0/0x11
>> >
>> > On the MDS I se this in dmesg:
>> > BUG: soft lockup - CPU#1 stuck for 10s! [ll_evictor:11659]
>> > CPU 1:
>> > Modules linked in: mds(U) fsfilt_ldiskfs(U) mgs(U) mgc(U) ldiskfs(U)
>> lustre(U) lov(U) mdc(U) lquota(U) osc(U) ptlrpc(U) obdclass(U) lvfs(U)
>> nfs(U) lockd(U) fscache(U) nfs_acl(U
>> > ) ksocklnd(U) ko2iblnd(U) lnet(U) libcfs(U) raid10(U) crc16(U) raid1(U)
>> autofs4(U) ipmi_devintf(U) ipmi_si(U) ipmi_msghandler(U) sunrpc(U)
>> ip_conntrack_netbios_ns(U) ipt_REJECT
>> > (U) xt_state(U) ip_conntrack(U) nfnetlink(U) iptable_filter(U)
>> ip_tables(U) ip6t_REJECT(U) xt_tcpudp(U) ip6table_filter(U) ip6_tables(U)
>> x_tables(U) cpufreq_ondemand(U) acpi_cp
>> > ufreq(U) freq_table(U) rdma_ucm(U) ib_sdp(U) rdma_cm(U) iw_cm(U)
>> ib_addr(U) ib_ipoib(U) ipoib_helper(U) ib_cm(U) ib_sa(U) ipv6(U)
>> xfrm_nalgo(U) crypto_api(U) ib_uverbs(U) ib_um
>> > ad(U) mlx4_ib(U) ib_mthca(U) ib_mad(U) ib_core(U) dm_mirror(U)
>> dm_multipath(U) scsi_dh(U) video(U) hwmon(U) backlight(U) sbs(U) i2c_ec(U)
>> button(U) battery(U) asus_acpi(U) acpi
>> > _memhotplug(U) ac(U) parport_pc(U) lp(U) parport(U) joydev(U) sr_mod(U)
>> cdrom(U) mlx4_core(U) igb(U) i2c_i801(U) i2c_core(U) pcspkr(U) dm_raid45(U)
>> dm_message(U) dm_region_hash
>> > (U) dm_log(U) dm_mod(U) dm_mem_cache(U) mppVhba(U) usb_storage(U)
>> ahci(U) libata(U) mptsas(U) mptscsih(U) scsi_transport_sas(U) mptbase(U)
>> shpchp(U) aacraid(U) mppUpper(U) sg(U
>> > ) sd_mod(U) scsi_mod(U) ext3(U) jbd(U) uhci_hcd(U) ohci_hcd(U)
>> ehci_hcd(U)
>> > Pid: 11659, comm: ll_evictor Tainted: G
>>  2.6.18-128.7.1.el5_lustre.1.8.1.1 #1
>> > RIP: 0010:[<ffffffff80064aee>]  [<ffffffff80064aee>] _write_lock+0x7/0xf
>> > RSP: 0018:ffff81034fcefc78  EFLAGS: 00000246
>> > RAX: 000000000000ffff RBX: 000000000000a09f RCX: 00000000000d0117
>> > RDX: 0000000000000195 RSI: ffffffff802fae80 RDI: ffffc20011f029fc
>> > RBP: 0000000000000286 R08: ffff81000100e8e0 R09: 0000000000000000
>> > R10: ffff8106070f5200 R11: 0000000000000150 R12: 0000000000000286
>> > R13: ffff81034fcefc20 R14: ffff8106070f5258 R15: ffff8106070f5200
>> > FS:  00002af849256220(0000) GS:ffff81010c4994c0(0000)
>> knlGS:0000000000000000
>> > CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
>> > CR2: 000000386ba41900 CR3: 0000000000201000 CR4: 00000000000006e0
>> >
>> > Call Trace:
>> >  [<ffffffff88830ba7>] :obdclass:lustre_hash_for_each_empty+0x237/0x2b0
>> >  [<ffffffff88837ae8>] :obdclass:class_disconnect+0x398/0x420
>> >  [<ffffffff88bc75e1>] :mds:mds_disconnect+0x121/0xe40
>> >  [<ffffffff8014b87a>] snprintf+0x44/0x4c
>> >  [<ffffffff88833994>] :obdclass:class_fail_export+0x384/0x4c0
>> >  [<ffffffff88904238>] :ptlrpc:ping_evictor_main+0x4f8/0x7e0
>> >  [<ffffffff8008a3f3>] default_wake_function+0x0/0xe
>> >  [<ffffffff800b48f2>] audit_syscall_exit+0x33c/0x357
>> >  [<ffffffff8005dfb1>] child_rip+0xa/0x11
>> >  [<ffffffff88903d40>] :ptlrpc:ping_evictor_main+0x0/0x7e0
>> >  [<ffffffff8005dfa7>] child_rip+0x0/0x11
>> >
>> > Can anyone shed some light on this?
>> >
>> > Thanks
>> > Erik
>> >
>> >
>> > _______________________________________________
>> > Lustre-discuss mailing list
>> > Lustre-discuss at lists.lustre.org
>> > http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>
>> _______________________________________________
>> Lustre-discuss mailing list
>> Lustre-discuss at lists.lustre.org
>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20100102/5fa03d7b/attachment.htm>


More information about the lustre-discuss mailing list