[Lustre-discuss] kernel: BUG: soft lockup - CPU stuck for 10s! with lustre 1.8.4

Fan Yong yong.fan at whamcloud.com
Fri Sep 17 08:47:40 PDT 2010


  On 9/14/10 8:55 PM, Gabriele Paciucci wrote:
> I have the same problem, I put the statahead_max to 0 !!!
In fact, I have made a patch for statahead related issues (including 
this one) against lustre-1.8, which is in inspection.
http://review.whamcloud.com/#change,2
If possible, you can try such patch.

Cheers,
--
Nasf
> On 09/14/2010 11:55 AM, Fan Yong wrote:
>> On 9/14/10 5:21 PM, Temple Jason wrote:
>>> Hello,
>>>
>>> I have recently upgraded my lustre filesystem from 1.8.3 to 1.8.4.
>>> The first day we brought our system online with the new version, we
>>> started seeing clients getting stuck in this soft lockup loop. The
>>> load shoots up over 120, and eventually the node becomes unusable and
>>> requires a hard reset. I’ve seen loops like this on the server side
>>> in previous lustre versions, but to have it happen on the client is
>>> completely new. Here is a bit of what I see in the logs:
>>>
>>> Sep 13 21:11:39 wn122 kernel: LustreError:
>>> 27016:0:(statahead.c:289:ll_sai_entry_fini())
>>> ASSERTION(sa_is_stopped(sai)) failed
>>>
>> The soft lockup is caused by this ASSERTION, because when such
>> ASSERTION triggered, the thread holds a spin_lock, then caused other
>> threads busy waiting for such spin_lock.
>>
>>> Sep 13 21:11:39 wn122 kernel: LustreError:
>>> 27016:0:(statahead.c:289:ll_sai_entry_fini()) LBUG
>>>
>>> Sep 13 21:11:39 wn122 kernel: Pid: 27016, comm: athena.py
>>>
>>> Sep 13 21:11:39 wn122 kernel:
>>>
>>> Sep 13 21:11:39 wn122 kernel: Call Trace:
>>>
>>> Sep 13 21:11:39 wn122 kernel: [<ffffffff885cf6a1>]
>>> libcfs_debug_dumpstack+0x51/0x60 [libcfs]
>>>
>>> Sep 13 21:11:39 wn122 kernel: [<ffffffff885cfbda>]
>>> lbug_with_loc+0x7a/0xd0 [libcfs]
>>>
>>> Sep 13 21:11:39 wn122 kernel: [<ffffffff885d7f00>]
>>> tracefile_init+0x0/0x110 [libcfs]
>>>
>>> Sep 13 21:11:39 wn122 kernel: [<ffffffff889247e9>]
>>> ll_statahead_exit+0x409/0x500 [lustre]
>>>
>>> Sep 13 21:11:39 wn122 kernel: [<ffffffff8008a4b4>]
>>> default_wake_function+0x0/0xe
>>>
>>> Sep 13 21:11:39 wn122 kernel: [<ffffffff888d2f2e>]
>>> ll_intent_drop_lock+0x8e/0xb0 [lustre]
>>>
>>> Sep 13 21:11:39 wn122 kernel: [<ffffffff8891c2bb>]
>>> ll_lookup_it+0x30b/0x7c0 [lustre]
>>>
>>> Sep 13 21:11:39 wn122 kernel: [<ffffffff888e250d>]
>>> __ll_inode_revalidate_it+0x5bd/0x650 [lustre]
>>>
>>> Sep 13 21:11:39 wn122 kernel: [<ffffffff886f34f4>]
>>> ldlm_lock_add_to_lru+0x74/0xe0 [ptlrpc]
>>>
>>> Sep 13 21:11:39 wn122 kernel: [<ffffffff88919431>]
>>> ll_convert_intent+0xb1/0x170 [lustre]
>>>
>>> Sep 13 21:11:39 wn122 kernel: [<ffffffff8891d4c7>]
>>> ll_lookup_nd+0x207/0x400 [lustre]
>>>
>>> Sep 13 21:11:39 wn122 kernel: [<ffffffff8002229b>] d_alloc+0x174/0x1a9
>>>
>>> Sep 13 21:11:39 wn122 kernel: [<ffffffff8000cc40>] do_lookup+0xe5/0x1e6
>>>
>>> Sep 13 21:11:39 wn122 kernel: [<ffffffff80009fce>]
>>> __link_path_walk+0xa01/0xf42
>>>
>>> Sep 13 21:11:39 wn122 kernel: [<ffffffff8000e803>]
>>> link_path_walk+0x5c/0xe5
>>>
>>> Sep 13 21:11:39 wn122 kernel: [<ffffffff80034cd4>] vfs_readdir+0x94/0xa9
>>>
>>> Sep 13 21:11:39 wn122 kernel: [<ffffffff800f4fa9>]
>>> compat_sys_getdents+0xaf/0xbd
>>>
>>> Sep 13 21:11:39 wn122 kernel: [<ffffffff8000c9df>]
>>> do_path_lookup+0x270/0x2e8
>>>
>>> Sep 13 21:11:39 wn122 kernel: [<ffffffff800123f7>] getname+0x15b/0x1c1
>>>
>>> Sep 13 21:11:39 wn122 kernel: [<ffffffff80023318>]
>>> __user_walk_fd+0x37/0x4c
>>>
>>> Sep 13 21:11:39 wn122 kernel: [<ffffffff800320b8>]
>>> sys_faccessat+0xe4/0x18d
>>>
>>> Sep 13 21:11:39 wn122 kernel: [<ffffffff80034cd4>] vfs_readdir+0x94/0xa9
>>>
>>> Sep 13 21:11:39 wn122 kernel: [<ffffffff800f4fa9>]
>>> compat_sys_getdents+0xaf/0xbd
>>>
>>> Sep 13 21:11:39 wn122 kernel: [<ffffffff8006149b>]
>>> sysenter_do_call+0x1b/0x67
>>>
>>> Sep 13 21:11:39 wn122 kernel: [<ffffffff8002c205>]
>>> dummy_inode_permission+0x0/0x3
>>>
>>> Sep 13 21:11:39 wn122 kernel:
>>>
>>> Sep 13 21:11:39 wn122 kernel: LustreError: dumping log to
>>> /tmp/lustre-log.1284405099.27016
>>>
>>> Sep 13 21:11:44 wn122 dhclient: DHCPREQUEST on eth0 to 148.187.67.113
>>> port 67
>>>
>>> Sep 13 21:11:49 wn122 kernel: BUG: soft lockup - CPU#3 stuck for 10s!
>>> [ptlrpcd:31817]
>>>
>>> Sep 13 21:11:49 wn122 kernel: CPU 3:
>>>
>>> Sep 13 21:11:49 wn122 kernel: Modules linked in: mgc(U) lustre(U)
>>> lov(U) mdc(U) lquota(U) osc(U) ko2iblnd(U) ptlrpc(U) obdclass(U)
>>> lnet(U) lvfs(U) libcfs(U) nfs fscache nfs_acl loc
>>>
>>> kd sunrpc bonding(U) ip_conntrack_netbios_ns ipt_REJECT xt_tcpudp
>>> xt_state iptable_filter iptable_nat ip_nat ip_conntrack nfnetlink
>>> iptable_mangle ip_tables x_tables rdma_ucm(U) ib
>>>
>>> _sdp(U) rdma_cm(U) iw_cm(U) ib_addr(U) ib_ipoib(U) ipoib_helper(U)
>>> ib_cm(U) ipv6 xfrm_nalgo crypto_api ib_uverbs(U) ib_umad(U)
>>> mlx4_vnic(U) ib_sa(U) mlx4_ib(U) ib_mthca(U) ib_mad(U
>>>
>>> ) ib_core(U) dm_multipath scsi_dh video hwmon backlight sbs i2c_ec
>>> button battery asus_acpi acpi_memhotplug ac parport_pc lp parport
>>> joydev sg i2c_i801 i2c_core e1000e shpchp mlx4_
>>>
>>> core(U) pcspkr dm_raid45 dm_message dm_region_hash dm_mem_cache
>>> dm_snapshot dm_zero dm_mirror dm_log dm_mod ata_piix libata sd_mod
>>> scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd
>>>
>>> Sep 13 21:11:49 wn122 kernel: Pid: 31817, comm: ptlrpcd Tainted: G
>>> 2.6.18-128.7.1.el5 #1
>>>
>>> Sep 13 21:11:49 wn122 kernel: RIP: 0010:[<ffffffff80064cb7>]
>>> [<ffffffff80064cb7>] .text.lock.spinlock+0x5/0x30
>>>
>>> Sep 13 21:11:49 wn122 kernel: RSP: 0018:ffff8101ec177cb8 EFLAGS: 00000282
>>>
>>> Sep 13 21:11:49 wn122 kernel: RAX: 000000000000004f RBX:
>>> 0000000000000000 RCX: 0000000000000000
>>>
>>> Sep 13 21:11:49 wn122 kernel: RDX: ffff81035956b480 RSI:
>>> ffff810253c2d400 RDI: ffff810552ccb500
>>>
>>> Sep 13 21:11:49 wn122 kernel: RBP: ffff810192294000 R08:
>>> 5a5a5a5a5a5a5a5a R09: 5a5a5a5a5a5a5a5a
>>>
>>> Sep 13 21:11:49 wn122 kernel: R10: 5a5a5a5a5a5a5a5a R11:
>>> 5a5a5a5a5a5a5a5a R12: 0000000000000038
>>>
>>> Sep 13 21:11:49 wn122 kernel: R13: ffff81045b0150c0 R14:
>>> ffff81067fc57000 R15: ffffffff886f5168
>>>
>>> Sep 13 21:11:49 wn122 kernel: FS: 00002b5af649d240(0000)
>>> GS:ffff81010c4c8e40(0000) knlGS:0000000000000000
>>>
>>> Sep 13 21:11:49 wn122 kernel: CS: 0010 DS: 0000 ES: 0000 CR0:
>>> 000000008005003b
>>>
>>> Sep 13 21:11:49 wn122 kernel: CR2: 0000000008183094 CR3:
>>> 0000000000201000 CR4: 00000000000006e0
>>>
>>> Sep 13 21:11:49 wn122 kernel:
>>>
>>> Sep 13 21:11:49 wn122 kernel: Call Trace:
>>>
>>> Sep 13 21:11:49 wn122 kernel: [<ffffffff889249dc>]
>>> :lustre:ll_statahead_interpret+0xfc/0x5b0
>>>
>>> Sep 13 21:11:49 wn122 kernel: [<ffffffff88861779>]
>>> :mdc:mdc_intent_getattr_async_interpret+0x459/0x490
>>>
>>> Sep 13 21:11:49 wn122 kernel: [<ffffffff88729246>]
>>> :ptlrpc:ptlrpc_check_set+0x1186/0x1440
>>>
>>> Sep 13 21:11:49 wn122 kernel: [<ffffffff8004ac1b>]
>>> try_to_del_timer_sync+0x51/0x5a
>>>
>>> Sep 13 21:11:49 wn122 kernel: [<ffffffff8875d05d>]
>>> :ptlrpc:ptlrpcd_check+0xdd/0x1f0
>>>
>>> Sep 13 21:11:49 wn122 kernel: [<ffffffff80095004>]
>>> process_timeout+0x0/0x5
>>>
>>> Sep 13 21:11:49 wn122 kernel: [<ffffffff8875d9a1>]
>>> :ptlrpc:ptlrpcd+0x1b1/0x259
>>>
>>> Sep 13 21:11:49 wn122 kernel: [<ffffffff8008a4b4>]
>>> default_wake_function+0x0/0xe
>>>
>>> Sep 13 21:11:49 wn122 kernel: [<ffffffff8005dfb1>] child_rip+0xa/0x11
>>>
>>> Sep 13 21:11:49 wn122 kernel: [<ffffffff8875d7f0>]
>>> :ptlrpc:ptlrpcd+0x0/0x259
>>>
>>> Sep 13 21:11:49 wn122 kernel: [<ffffffff8005dfa7>] child_rip+0x0/0x11
>>>
>>> Sep 13 21:11:49 wn122 kernel:
>>>
>>> Sep 13 21:11:49 wn122 kernel: BUG: soft lockup - CPU#0 stuck for 10s!
>>> [ll_sa_27016:28593]
>>>
>>> Sep 13 21:11:49 wn122 kernel: CPU 0:
>>>
>>> Sep 13 21:11:49 wn122 kernel: Modules linked in: mgc(U) lustre(U)
>>> lov(U) mdc(U) lquota(U) osc(U) ko2iblnd(U) ptlrpc(U) obdclass(U)
>>> lnet(U) lvfs(U) libcfs(U) nfs fscache nfs_acl lockd sunrpc bonding(U)
>>> ip_conntrack_netbios_ns ipt_REJECT xt_tcpudp xt_state iptable_filter
>>> iptable_nat ip_nat ip_conntrack nfnetlink iptable_mangle ip_tables
>>> x_tables rdma_ucm(U) ib_sdp(U) rdma_cm(U) iw_cm(U) ib_addr(U)
>>> ib_ipoib(U) ipoib_helper(U) ib_cm(U) ipv6 xfrm_nalgo crypto_api
>>> ib_uverbs(U) ib_umad(U) mlx4_vnic(U) ib_sa(U) mlx4_ib(U) ib_mthca(U)
>>> ib_mad(U) ib_core(U) dm_multipath scsi_dh video hwmon backlight sbs
>>> i2c_ec button battery asus_acpi acpi_memhotplug ac parport_pc lp
>>> parport joydev sg i2c_i801 i2c_core e1000e shpchp mlx4_core(U) pcspkr
>>> dm_raid45 dm_message dm_region_hash dm_mem_cache dm_snapshot dm_zero
>>> dm_mirror dm_log dm_mod ata_piix libata sd_mod scsi_mod ext3 jbd
>>> uhci_hcd ohci_hcd ehci_hcd
>>>
>>> Sep 13 21:11:49 wn122 kernel: Pid: 28593, comm: ll_sa_27016 Tainted:
>>> G 2.6.18-128.7.1.el5 #1
>>>
>>> Sep 13 21:11:49 wn122 kernel: RIP: 0010:[<ffffffff80064cb4>]
>>> [<ffffffff80064cb4>] .text.lock.spinlock+0x2/0x30
>>>
>>> Sep 13 21:11:49 wn122 kernel: RSP: 0000:ffff810251edfcf8 EFLAGS: 00000282
>>>
>>> Sep 13 21:11:49 wn122 kernel: RAX: 0000000000000001 RBX:
>>> ffff8101cf193e80 RCX: 0000000000000000
>>>
>>> Sep 13 21:11:49 wn122 kernel: RDX: 0000000000000012 RSI:
>>> 000000000abe79b0 RDI: ffff810552ccb500
>>>
>>> Sep 13 21:11:49 wn122 kernel: RBP: 0000000000000286 R08:
>>> 0000000351edfde0 R09: 0000000000000000
>>>
>>> Sep 13 21:11:49 wn122 kernel: R10: ffff810311d9a400 R11:
>>> 0000000000000248 R12: ffff81023f907a00
>>>
>>> Sep 13 21:11:49 wn122 kernel: R13: ffffffff88726dba R14:
>>> ffff810162543bc0 R15: 0000000000000000
>>>
>>> Sep 13 21:11:49 wn122 kernel: FS: 0000000000000000(0000)
>>> GS:ffffffff803ac000(0000) knlGS:0000000000000000
>>>
>>> Sep 13 21:11:49 wn122 kernel: CS: 0010 DS: 002b ES: 002b CR0:
>>> 000000008005003b
>>>
>>> Sep 13 21:11:49 wn122 kernel: CR2: 000000000c0f900c CR3:
>>> 000000066b863000 CR4: 00000000000006e0
>>>
>>> Sep 13 21:11:49 wn122 kernel:
>>>
>>> Sep 13 21:11:49 wn122 kernel: Call Trace:
>>>
>>> Sep 13 21:11:49 wn122 kernel: [<ffffffff88923787>]
>>> :lustre:ll_sai_entry_to_stated+0x87/0x330
>>>
>>> Sep 13 21:11:49 wn122 kernel: [<ffffffff8000d0de>] dput+0x2c/0x114
>>>
>>> Sep 13 21:11:49 wn122 kernel: [<ffffffff88923f8b>]
>>> :lustre:do_statahead_interpret+0x55b/0x5c0
>>>
>>> Sep 13 21:11:49 wn122 kernel: [<ffffffff88925641>]
>>> :lustre:ll_statahead_thread+0x7b1/0x1750
>>>
>>> Sep 13 21:11:49 wn122 kernel: [<ffffffff8008a4b4>]
>>> default_wake_function+0x0/0xe
>>>
>>> Sep 13 21:11:49 wn122 kernel: [<ffffffff8005dfb1>] child_rip+0xa/0x11
>>>
>>> Sep 13 21:11:49 wn122 kernel: [<ffffffff88924e90>]
>>> :lustre:ll_statahead_thread+0x0/0x1750
>>>
>>> Sep 13 21:11:49 wn122 kernel: [<ffffffff8005dfa7>] child_rip+0x0/0x11
>>>
>>> After we had 3 clients go into this state the first day, and 2 the
>>> next, we decided to downgrade the clients back to 1.8.3, while
>>> leaving the servers at 1.8.4. This morning, we had 2 more clients get
>>> stuck again, so my assumption is that there is a problem with the
>>> 1.8.4 servers.
>>>
>> This ASSERTION looks related with statahead, such feature only affects
>> client-side behavior, but nothing for server. On the other hand, we
>> never saw that on lustre-1.8.3 or former version, and as I known,
>> there were almost no changes for statahead between lustre-1.8.3 and
>> lustre-1.8.4. So I am not sure whether your assumption is right not.
>>
>> Anyway, to make your system run, you can disable statahead on
>> client-side firstly, to check what will happen after that. The
>> client-side proc interface "statahead_max" is for that.
>>
>> Cheers,
>> Nasf
>>> These are RedHat 5.4 servers and clients, with the
>>> 2.6.18-194.3.1.el5_lustre.1.8.4 kernel.
>>>
>>> Any help on this issue would be great.
>>>
>>> Thanks,
>>>
>>> Jason Temple
>>>
>>>
>>> _______________________________________________
>>> Lustre-discuss mailing list
>>> Lustre-discuss at lists.lustre.org
>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>>
>>
>> _______________________________________________
>> Lustre-discuss mailing list
>> Lustre-discuss at lists.lustre.org
>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>
>




More information about the lustre-discuss mailing list