[Lustre-discuss] CPU stuck errors

Aaron Knister aaron.knister at gmail.com
Thu Dec 31 10:29:33 PST 2009


My best guess is that there's something still not right on-disk. This is my $.02, but I'd suggest running another fsck on that ost if you haven't already. 

On Dec 31, 2009, at 12:20 PM, Erik Froese wrote:

> Yesterday we had an OST fail. I had to e2fsck it to fix it (although there was some corruption).
> 
> All servers are:
> RH 5.3 with Lustre 1.8.1.1 
> Mellanox QDR IB
> Fiber connected storage.
> 
> I left the OST mounted yesterday but it was deactivated on the mds.
> [root at mds-0-0 osc]# cat /proc/fs/lustre/osc/scratch-OST000e-osc/active 
> 0
> 
> At some point this morning the OSS locked up and had to be rebooted. I couldn't even access it on the console.
> 
> I see I/O errors when trying to copy files that exist on that OST (lfs find -O scratch-OST000e_UUID /scratch)
> 
> Now we're seeing CPU lockups on the OSS
> 
> Soft lockup - CPU#4 stuck for 10s! [ll_ost_26:10467]
> Dec 31 12:12:12 oss-0-0 kernel: BUG: soft lockup - CPU#2 stuck for 10s! [ll_ost_500:12066]
> Dec 31 12:12:22 oss-0-0 kernel: BUG: soft lockup - CPU#4 stuck for 10s! [ll_ost_26:10467]
> Dec 31 12:12:22 oss-0-0 kernel: BUG: soft lockup - CPU#2 stuck for 10s! [ll_ost_500:12066]
> Dec 31 12:12:32 oss-0-0 kernel: BUG: soft lockup - CPU#4 stuck for 10s! [ll_ost_26:10467]
> Dec 31 12:12:32 oss-0-0 kernel: BUG: soft lockup - CPU#2 stuck for 10s! [ll_ost_500:12066]
> Dec 31 12:12:42 oss-0-0 kernel: BUG: soft lockup - CPU#4 stuck for 10s! [ll_ost_26:10467]
> Dec 31 12:12:42 oss-0-0 kernel: BUG: soft lockup - CPU#2 stuck for 10s! [ll_ost_500:12066]
> 
> dmesg:
> 
> BUG: soft lockup - CPU#2 stuck for 10s! [ll_ost_500:12066]
> CPU 2:
> Modules linked in: obdfilter(U) fsfilt_ldiskfs(U) mgc(U) ldiskfs(U) crc16(U) ost(U) lustre(U) lov(U) mdc(U) lquota(U) osc(U) ptlrpc(U) obdclass(U) lvfs(U) ksocklnd(U) ko2iblnd(
> U) lnet(U) libcfs(U) autofs4(U) ipmi_devintf(U) ipmi_si(U) ipmi_msghandler(U) sunrpc(U) ip_conntrack_netbios_ns(U) ipt_REJECT(U) xt_state(U) ip_conntrack(U) nfnetlink(U) iptabl
> e_filter(U) ip_tables(U) ip6t_REJECT(U) xt_tcpudp(U) ip6table_filter(U) ip6_tables(U) x_tables(U) rdma_ucm(U) ib_sdp(U) rdma_cm(U) iw_cm(U) ib_addr(U) ib_ipoib(U) ipoib_helper(
> U) ib_cm(U) ib_sa(U) ipv6(U) xfrm_nalgo(U) crypto_api(U) ib_uverbs(U) ib_umad(U) mlx4_ib(U) ib_mthca(U) ib_mad(U) ib_core(U) dm_mirror(U) dm_multipath(U) scsi_dh(U) video(U) hw
> mon(U) backlight(U) sbs(U) i2c_ec(U) button(U) battery(U) asus_acpi(U) acpi_memhotplug(U) ac(U) parport_pc(U) lp(U) parport(U) joydev(U) sr_mod(U) cdrom(U) mlx4_core(U) pcspkr(
> U) igb(U) i2c_i801(U) i2c_core(U) dm_raid45(U) dm_message(U) dm_region_hash(U) dm_log(U) dm_mod(U) dm_mem_cache(U) mppVhba(U) usb_storage(U) qla2xxx(U) scsi_transport_fc(U) ahc
> i(U) libata(U) shpchp(U) aacraid(U) mppUpper(U) sg(U) sd_mod(U) scsi_mod(U) ext3(U) jbd(U) uhci_hcd(U) ohci_hcd(U) ehci_hcd(U)
> Pid: 12066, comm: ll_ost_500 Tainted: G      2.6.18-128.7.1.el5_lustre.1.8.1.1 #1
> RIP: 0010:[<ffffffff88b3bb64>]  [<ffffffff88b3bb64>] :ldiskfs:ldiskfs_find_entry+0x244/0x5c0
> RSP: 0018:ffff8106656e9610  EFLAGS: 00000202
> RAX: 0000000000000000 RBX: 0000000000000007 RCX: 0000000000000004
> RDX: ffff81031a0f3000 RSI: ffff81033a504b4f RDI: ffff8103551591fb
> RBP: 0000000000000002 R08: ffff810355159ff8 R09: ffff810355159000
> R10: ffff810001014f00 R11: 000000004b3cd2fd R12: ffff81033a5d7550
> R13: ffffffff80063b4c R14: ffff8106656e96c8 R15: ffffffff80014fae
> FS:  00002aeb511bd220(0000) GS:ffff81010c499240(0000) knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> CR2: 00000039f5499a50 CR3: 0000000000201000 CR4: 00000000000006e0
> 
> Call Trace:
>  [<ffffffff80089bdd>] dequeue_task+0x18/0x37
>  [<ffffffff80063098>] thread_return+0x62/0xfe
>  [<ffffffff88b3df63>] :ldiskfs:ldiskfs_lookup+0x53/0x290
>  [<ffffffff800366e8>] __lookup_hash+0x10b/0x130
>  [<ffffffff80063d2a>] .text.lock.mutex+0x5/0x14
>  [<ffffffff800e2c9b>] lookup_one_len+0x53/0x61
>  [<ffffffff88ba61ed>] :obdfilter:filter_fid2dentry+0x42d/0x730
>  [<ffffffff8026f08b>] __down_trylock+0x44/0x4e
>  [<ffffffff800647c0>] __down_failed_trylock+0x35/0x3a
>  [<ffffffff88bbff9b>] :obdfilter:filter_lvbo_init+0x3bb/0x68b
>  [<ffffffff888da2e6>] :ptlrpc:ldlm_resource_get+0x8f6/0xa50
>  [<ffffffff88b084a0>] :ost:ost_blocking_ast+0x0/0xaa0
>  [<ffffffff888f9470>] :ptlrpc:ldlm_server_completion_ast+0x0/0x5d0
>  [<ffffffff888d0eba>] :ptlrpc:ldlm_lock_create+0xba/0x9f0
>  [<ffffffff889158d1>] :ptlrpc:lustre_swab_buf+0x81/0x170
>  [<ffffffff888e1adb>] :ptlrpc:target_queue_recovery_request+0x9b/0x1690
>  [<ffffffff888f3f30>] :ptlrpc:ldlm_server_glimpse_ast+0x0/0x3b0
>  [<ffffffff888f3f30>] :ptlrpc:ldlm_server_glimpse_ast+0x0/0x3b0
>  [<ffffffff888f9470>] :ptlrpc:ldlm_server_completion_ast+0x0/0x5d0
>  [<ffffffff88b084a0>] :ost:ost_blocking_ast+0x0/0xaa0
>  [<ffffffff888f75a0>] :ptlrpc:ldlm_handle_enqueue+0x670/0x1210
>  [<ffffffff889145d8>] :ptlrpc:lustre_msg_check_version_v2+0x8/0x20
>  [<ffffffff88b107b3>] :ost:ost_handle+0x54b3/0x5a70
>  [<ffffffff800d73cd>] free_block+0x126/0x143
>  [<ffffffff88758305>] :lnet:lnet_match_blocked_msg+0x375/0x390
>  [<ffffffff88919f05>] :ptlrpc:lustre_msg_get_conn_cnt+0x35/0xf0
>  [<ffffffff80089d8d>] enqueue_task+0x41/0x56
>  [<ffffffff8891ec1d>] :ptlrpc:ptlrpc_check_req+0x1d/0x110
>  [<ffffffff88921357>] :ptlrpc:ptlrpc_server_handle_request+0xa97/0x1170
>  [<ffffffff80063098>] thread_return+0x62/0xfe
>  [<ffffffff8003d382>] lock_timer_base+0x1b/0x3c
>  [<ffffffff8001c6fa>] __mod_timer+0xb0/0xbe
>  [<ffffffff88924e08>] :ptlrpc:ptlrpc_main+0x1218/0x13e0
>  [<ffffffff8008a3f3>] default_wake_function+0x0/0xe
>  [<ffffffff800b48f2>] audit_syscall_exit+0x33c/0x357
>  [<ffffffff8005dfb1>] child_rip+0xa/0x11
>  [<ffffffff88923bf0>] :ptlrpc:ptlrpc_main+0x0/0x13e0
>  [<ffffffff8005dfa7>] child_rip+0x0/0x11
> 
> On the MDS I se this in dmesg:
> BUG: soft lockup - CPU#1 stuck for 10s! [ll_evictor:11659]
> CPU 1:
> Modules linked in: mds(U) fsfilt_ldiskfs(U) mgs(U) mgc(U) ldiskfs(U) lustre(U) lov(U) mdc(U) lquota(U) osc(U) ptlrpc(U) obdclass(U) lvfs(U) nfs(U) lockd(U) fscache(U) nfs_acl(U
> ) ksocklnd(U) ko2iblnd(U) lnet(U) libcfs(U) raid10(U) crc16(U) raid1(U) autofs4(U) ipmi_devintf(U) ipmi_si(U) ipmi_msghandler(U) sunrpc(U) ip_conntrack_netbios_ns(U) ipt_REJECT
> (U) xt_state(U) ip_conntrack(U) nfnetlink(U) iptable_filter(U) ip_tables(U) ip6t_REJECT(U) xt_tcpudp(U) ip6table_filter(U) ip6_tables(U) x_tables(U) cpufreq_ondemand(U) acpi_cp
> ufreq(U) freq_table(U) rdma_ucm(U) ib_sdp(U) rdma_cm(U) iw_cm(U) ib_addr(U) ib_ipoib(U) ipoib_helper(U) ib_cm(U) ib_sa(U) ipv6(U) xfrm_nalgo(U) crypto_api(U) ib_uverbs(U) ib_um
> ad(U) mlx4_ib(U) ib_mthca(U) ib_mad(U) ib_core(U) dm_mirror(U) dm_multipath(U) scsi_dh(U) video(U) hwmon(U) backlight(U) sbs(U) i2c_ec(U) button(U) battery(U) asus_acpi(U) acpi
> _memhotplug(U) ac(U) parport_pc(U) lp(U) parport(U) joydev(U) sr_mod(U) cdrom(U) mlx4_core(U) igb(U) i2c_i801(U) i2c_core(U) pcspkr(U) dm_raid45(U) dm_message(U) dm_region_hash
> (U) dm_log(U) dm_mod(U) dm_mem_cache(U) mppVhba(U) usb_storage(U) ahci(U) libata(U) mptsas(U) mptscsih(U) scsi_transport_sas(U) mptbase(U) shpchp(U) aacraid(U) mppUpper(U) sg(U
> ) sd_mod(U) scsi_mod(U) ext3(U) jbd(U) uhci_hcd(U) ohci_hcd(U) ehci_hcd(U)
> Pid: 11659, comm: ll_evictor Tainted: G      2.6.18-128.7.1.el5_lustre.1.8.1.1 #1
> RIP: 0010:[<ffffffff80064aee>]  [<ffffffff80064aee>] _write_lock+0x7/0xf
> RSP: 0018:ffff81034fcefc78  EFLAGS: 00000246
> RAX: 000000000000ffff RBX: 000000000000a09f RCX: 00000000000d0117
> RDX: 0000000000000195 RSI: ffffffff802fae80 RDI: ffffc20011f029fc
> RBP: 0000000000000286 R08: ffff81000100e8e0 R09: 0000000000000000
> R10: ffff8106070f5200 R11: 0000000000000150 R12: 0000000000000286
> R13: ffff81034fcefc20 R14: ffff8106070f5258 R15: ffff8106070f5200
> FS:  00002af849256220(0000) GS:ffff81010c4994c0(0000) knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> CR2: 000000386ba41900 CR3: 0000000000201000 CR4: 00000000000006e0
> 
> Call Trace:
>  [<ffffffff88830ba7>] :obdclass:lustre_hash_for_each_empty+0x237/0x2b0
>  [<ffffffff88837ae8>] :obdclass:class_disconnect+0x398/0x420
>  [<ffffffff88bc75e1>] :mds:mds_disconnect+0x121/0xe40
>  [<ffffffff8014b87a>] snprintf+0x44/0x4c
>  [<ffffffff88833994>] :obdclass:class_fail_export+0x384/0x4c0
>  [<ffffffff88904238>] :ptlrpc:ping_evictor_main+0x4f8/0x7e0
>  [<ffffffff8008a3f3>] default_wake_function+0x0/0xe
>  [<ffffffff800b48f2>] audit_syscall_exit+0x33c/0x357
>  [<ffffffff8005dfb1>] child_rip+0xa/0x11
>  [<ffffffff88903d40>] :ptlrpc:ping_evictor_main+0x0/0x7e0
>  [<ffffffff8005dfa7>] child_rip+0x0/0x11
> 
> Can anyone shed some light on this?
> 
> Thanks
> Erik
> 
> 
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss




More information about the lustre-discuss mailing list