[Lustre-discuss] kernel: BUG: soft lockup - CPU stuck for 10s! with lustre 1.8.4

Tue Sep 14 02:55:28 PDT 2010

  On 9/14/10 5:21 PM, Temple Jason wrote:
>
> Hello,
>
> I have recently upgraded my lustre filesystem from 1.8.3 to 1.8.4.  
> The first day we brought our system online with the new version, we 
> started seeing clients getting stuck in this soft lockup loop.  The 
> load shoots up over 120, and eventually the node becomes unusable and 
> requires a hard reset.  I've seen loops like this on the server side 
> in previous lustre versions, but to have it happen on the client is 
> completely new.  Here is a bit of what I see in the logs:
>
> Sep 13 21:11:39 wn122 kernel: LustreError: 
> 27016:0:(statahead.c:289:ll_sai_entry_fini()) 
> ASSERTION(sa_is_stopped(sai)) failed
>
The soft lockup is caused by this ASSERTION, because when such ASSERTION 
triggered, the thread holds a spin_lock, then caused other threads busy 
waiting for such spin_lock.

> Sep 13 21:11:39 wn122 kernel: LustreError: 
> 27016:0:(statahead.c:289:ll_sai_entry_fini()) LBUG
>
> Sep 13 21:11:39 wn122 kernel: Pid: 27016, comm: athena.py
>
> Sep 13 21:11:39 wn122 kernel:
>
> Sep 13 21:11:39 wn122 kernel: Call Trace:
>
> Sep 13 21:11:39 wn122 kernel:  [<ffffffff885cf6a1>] 
> libcfs_debug_dumpstack+0x51/0x60 [libcfs]
>
> Sep 13 21:11:39 wn122 kernel:  [<ffffffff885cfbda>] 
> lbug_with_loc+0x7a/0xd0 [libcfs]
>
> Sep 13 21:11:39 wn122 kernel:  [<ffffffff885d7f00>] 
> tracefile_init+0x0/0x110 [libcfs]
>
> Sep 13 21:11:39 wn122 kernel:  [<ffffffff889247e9>] 
> ll_statahead_exit+0x409/0x500 [lustre]
>
> Sep 13 21:11:39 wn122 kernel:  [<ffffffff8008a4b4>] 
> default_wake_function+0x0/0xe
>
> Sep 13 21:11:39 wn122 kernel:  [<ffffffff888d2f2e>] 
> ll_intent_drop_lock+0x8e/0xb0 [lustre]
>
> Sep 13 21:11:39 wn122 kernel:  [<ffffffff8891c2bb>] 
> ll_lookup_it+0x30b/0x7c0 [lustre]
>
> Sep 13 21:11:39 wn122 kernel:  [<ffffffff888e250d>] 
> __ll_inode_revalidate_it+0x5bd/0x650 [lustre]
>
> Sep 13 21:11:39 wn122 kernel:  [<ffffffff886f34f4>] 
> ldlm_lock_add_to_lru+0x74/0xe0 [ptlrpc]
>
> Sep 13 21:11:39 wn122 kernel:  [<ffffffff88919431>] 
> ll_convert_intent+0xb1/0x170 [lustre]
>
> Sep 13 21:11:39 wn122 kernel:  [<ffffffff8891d4c7>] 
> ll_lookup_nd+0x207/0x400 [lustre]
>
> Sep 13 21:11:39 wn122 kernel:  [<ffffffff8002229b>] d_alloc+0x174/0x1a9
>
> Sep 13 21:11:39 wn122 kernel:  [<ffffffff8000cc40>] do_lookup+0xe5/0x1e6
>
> Sep 13 21:11:39 wn122 kernel:  [<ffffffff80009fce>] 
> __link_path_walk+0xa01/0xf42
>
> Sep 13 21:11:39 wn122 kernel:  [<ffffffff8000e803>] 
> link_path_walk+0x5c/0xe5
>
> Sep 13 21:11:39 wn122 kernel:  [<ffffffff80034cd4>] vfs_readdir+0x94/0xa9
>
> Sep 13 21:11:39 wn122 kernel:  [<ffffffff800f4fa9>] 
> compat_sys_getdents+0xaf/0xbd
>
> Sep 13 21:11:39 wn122 kernel:  [<ffffffff8000c9df>] 
> do_path_lookup+0x270/0x2e8
>
> Sep 13 21:11:39 wn122 kernel:  [<ffffffff800123f7>] getname+0x15b/0x1c1
>
> Sep 13 21:11:39 wn122 kernel:  [<ffffffff80023318>] 
> __user_walk_fd+0x37/0x4c
>
> Sep 13 21:11:39 wn122 kernel:  [<ffffffff800320b8>] 
> sys_faccessat+0xe4/0x18d
>
> Sep 13 21:11:39 wn122 kernel:  [<ffffffff80034cd4>] vfs_readdir+0x94/0xa9
>
> Sep 13 21:11:39 wn122 kernel:  [<ffffffff800f4fa9>] 
> compat_sys_getdents+0xaf/0xbd
>
> Sep 13 21:11:39 wn122 kernel:  [<ffffffff8006149b>] 
> sysenter_do_call+0x1b/0x67
>
> Sep 13 21:11:39 wn122 kernel:  [<ffffffff8002c205>] 
> dummy_inode_permission+0x0/0x3
>
> Sep 13 21:11:39 wn122 kernel:
>
> Sep 13 21:11:39 wn122 kernel: LustreError: dumping log to 
> /tmp/lustre-log.1284405099.27016
>
> Sep 13 21:11:44 wn122 dhclient: DHCPREQUEST on eth0 to 148.187.67.113 
> port 67
>
> Sep 13 21:11:49 wn122 kernel: BUG: soft lockup - CPU#3 stuck for 10s! 
> [ptlrpcd:31817]
>
> Sep 13 21:11:49 wn122 kernel: CPU 3:
>
> Sep 13 21:11:49 wn122 kernel: Modules linked in: mgc(U) lustre(U) 
> lov(U) mdc(U) lquota(U) osc(U) ko2iblnd(U) ptlrpc(U) obdclass(U) 
> lnet(U) lvfs(U) libcfs(U) nfs fscache nfs_acl loc
>
> kd sunrpc bonding(U) ip_conntrack_netbios_ns ipt_REJECT xt_tcpudp 
> xt_state iptable_filter iptable_nat ip_nat ip_conntrack nfnetlink 
> iptable_mangle ip_tables x_tables rdma_ucm(U) ib
>
> _sdp(U) rdma_cm(U) iw_cm(U) ib_addr(U) ib_ipoib(U) ipoib_helper(U) 
> ib_cm(U) ipv6 xfrm_nalgo crypto_api ib_uverbs(U) ib_umad(U) 
> mlx4_vnic(U) ib_sa(U) mlx4_ib(U) ib_mthca(U) ib_mad(U
>
> ) ib_core(U) dm_multipath scsi_dh video hwmon backlight sbs i2c_ec 
> button battery asus_acpi acpi_memhotplug ac parport_pc lp parport 
> joydev sg i2c_i801 i2c_core e1000e shpchp mlx4_
>
> core(U) pcspkr dm_raid45 dm_message dm_region_hash dm_mem_cache 
> dm_snapshot dm_zero dm_mirror dm_log dm_mod ata_piix libata sd_mod 
> scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd
>
> Sep 13 21:11:49 wn122 kernel: Pid: 31817, comm: ptlrpcd Tainted: 
> G      2.6.18-128.7.1.el5 #1
>
> Sep 13 21:11:49 wn122 kernel: RIP: 0010:[<ffffffff80064cb7>]  
> [<ffffffff80064cb7>] .text.lock.spinlock+0x5/0x30
>
> Sep 13 21:11:49 wn122 kernel: RSP: 0018:ffff8101ec177cb8  EFLAGS: 00000282
>
> Sep 13 21:11:49 wn122 kernel: RAX: 000000000000004f RBX: 
> 0000000000000000 RCX: 0000000000000000
>
> Sep 13 21:11:49 wn122 kernel: RDX: ffff81035956b480 RSI: 
> ffff810253c2d400 RDI: ffff810552ccb500
>
> Sep 13 21:11:49 wn122 kernel: RBP: ffff810192294000 R08: 
> 5a5a5a5a5a5a5a5a R09: 5a5a5a5a5a5a5a5a
>
> Sep 13 21:11:49 wn122 kernel: R10: 5a5a5a5a5a5a5a5a R11: 
> 5a5a5a5a5a5a5a5a R12: 0000000000000038
>
> Sep 13 21:11:49 wn122 kernel: R13: ffff81045b0150c0 R14: 
> ffff81067fc57000 R15: ffffffff886f5168
>
> Sep 13 21:11:49 wn122 kernel: FS:  00002b5af649d240(0000) 
> GS:ffff81010c4c8e40(0000) knlGS:0000000000000000
>
> Sep 13 21:11:49 wn122 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 
> 000000008005003b
>
> Sep 13 21:11:49 wn122 kernel: CR2: 0000000008183094 CR3: 
> 0000000000201000 CR4: 00000000000006e0
>
> Sep 13 21:11:49 wn122 kernel:
>
> Sep 13 21:11:49 wn122 kernel: Call Trace:
>
> Sep 13 21:11:49 wn122 kernel:  [<ffffffff889249dc>] 
> :lustre:ll_statahead_interpret+0xfc/0x5b0
>
> Sep 13 21:11:49 wn122 kernel:  [<ffffffff88861779>] 
> :mdc:mdc_intent_getattr_async_interpret+0x459/0x490
>
> Sep 13 21:11:49 wn122 kernel:  [<ffffffff88729246>] 
> :ptlrpc:ptlrpc_check_set+0x1186/0x1440
>
> Sep 13 21:11:49 wn122 kernel:  [<ffffffff8004ac1b>] 
> try_to_del_timer_sync+0x51/0x5a
>
> Sep 13 21:11:49 wn122 kernel:  [<ffffffff8875d05d>] 
> :ptlrpc:ptlrpcd_check+0xdd/0x1f0
>
> Sep 13 21:11:49 wn122 kernel:  [<ffffffff80095004>] 
> process_timeout+0x0/0x5
>
> Sep 13 21:11:49 wn122 kernel:  [<ffffffff8875d9a1>] 
> :ptlrpc:ptlrpcd+0x1b1/0x259
>
> Sep 13 21:11:49 wn122 kernel:  [<ffffffff8008a4b4>] 
> default_wake_function+0x0/0xe
>
> Sep 13 21:11:49 wn122 kernel:  [<ffffffff8005dfb1>] child_rip+0xa/0x11
>
> Sep 13 21:11:49 wn122 kernel:  [<ffffffff8875d7f0>] 
> :ptlrpc:ptlrpcd+0x0/0x259
>
> Sep 13 21:11:49 wn122 kernel:  [<ffffffff8005dfa7>] child_rip+0x0/0x11
>
> Sep 13 21:11:49 wn122 kernel:
>
> Sep 13 21:11:49 wn122 kernel: BUG: soft lockup - CPU#0 stuck for 10s! 
> [ll_sa_27016:28593]
>
> Sep 13 21:11:49 wn122 kernel: CPU 0:
>
> Sep 13 21:11:49 wn122 kernel: Modules linked in: mgc(U) lustre(U) 
> lov(U) mdc(U) lquota(U) osc(U) ko2iblnd(U) ptlrpc(U) obdclass(U) 
> lnet(U) lvfs(U) libcfs(U) nfs fscache nfs_acl lockd sunrpc bonding(U) 
> ip_conntrack_netbios_ns ipt_REJECT xt_tcpudp xt_state iptable_filter 
> iptable_nat ip_nat ip_conntrack nfnetlink iptable_mangle ip_tables 
> x_tables rdma_ucm(U) ib_sdp(U) rdma_cm(U) iw_cm(U) ib_addr(U) 
> ib_ipoib(U) ipoib_helper(U) ib_cm(U) ipv6 xfrm_nalgo crypto_api 
> ib_uverbs(U) ib_umad(U) mlx4_vnic(U) ib_sa(U) mlx4_ib(U) ib_mthca(U) 
> ib_mad(U) ib_core(U) dm_multipath scsi_dh video hwmon backlight sbs 
> i2c_ec button battery asus_acpi acpi_memhotplug ac parport_pc lp 
> parport joydev sg i2c_i801 i2c_core e1000e shpchp mlx4_core(U) pcspkr 
> dm_raid45 dm_message dm_region_hash dm_mem_cache dm_snapshot dm_zero 
> dm_mirror dm_log dm_mod ata_piix libata sd_mod scsi_mod ext3 jbd 
> uhci_hcd ohci_hcd ehci_hcd
>
> Sep 13 21:11:49 wn122 kernel: Pid: 28593, comm: ll_sa_27016 Tainted: 
> G      2.6.18-128.7.1.el5 #1
>
> Sep 13 21:11:49 wn122 kernel: RIP: 0010:[<ffffffff80064cb4>]  
> [<ffffffff80064cb4>] .text.lock.spinlock+0x2/0x30
>
> Sep 13 21:11:49 wn122 kernel: RSP: 0000:ffff810251edfcf8  EFLAGS: 00000282
>
> Sep 13 21:11:49 wn122 kernel: RAX: 0000000000000001 RBX: 
> ffff8101cf193e80 RCX: 0000000000000000
>
> Sep 13 21:11:49 wn122 kernel: RDX: 0000000000000012 RSI: 
> 000000000abe79b0 RDI: ffff810552ccb500
>
> Sep 13 21:11:49 wn122 kernel: RBP: 0000000000000286 R08: 
> 0000000351edfde0 R09: 0000000000000000
>
> Sep 13 21:11:49 wn122 kernel: R10: ffff810311d9a400 R11: 
> 0000000000000248 R12: ffff81023f907a00
>
> Sep 13 21:11:49 wn122 kernel: R13: ffffffff88726dba R14: 
> ffff810162543bc0 R15: 0000000000000000
>
> Sep 13 21:11:49 wn122 kernel: FS:  0000000000000000(0000) 
> GS:ffffffff803ac000(0000) knlGS:0000000000000000
>
> Sep 13 21:11:49 wn122 kernel: CS:  0010 DS: 002b ES: 002b CR0: 
> 000000008005003b
>
> Sep 13 21:11:49 wn122 kernel: CR2: 000000000c0f900c CR3: 
> 000000066b863000 CR4: 00000000000006e0
>
> Sep 13 21:11:49 wn122 kernel:
>
> Sep 13 21:11:49 wn122 kernel: Call Trace:
>
> Sep 13 21:11:49 wn122 kernel:  [<ffffffff88923787>] 
> :lustre:ll_sai_entry_to_stated+0x87/0x330
>
> Sep 13 21:11:49 wn122 kernel:  [<ffffffff8000d0de>] dput+0x2c/0x114
>
> Sep 13 21:11:49 wn122 kernel:  [<ffffffff88923f8b>] 
> :lustre:do_statahead_interpret+0x55b/0x5c0
>
> Sep 13 21:11:49 wn122 kernel:  [<ffffffff88925641>] 
> :lustre:ll_statahead_thread+0x7b1/0x1750
>
> Sep 13 21:11:49 wn122 kernel:  [<ffffffff8008a4b4>] 
> default_wake_function+0x0/0xe
>
> Sep 13 21:11:49 wn122 kernel:  [<ffffffff8005dfb1>] child_rip+0xa/0x11
>
> Sep 13 21:11:49 wn122 kernel:  [<ffffffff88924e90>] 
> :lustre:ll_statahead_thread+0x0/0x1750
>
> Sep 13 21:11:49 wn122 kernel:  [<ffffffff8005dfa7>] child_rip+0x0/0x11
>
> After we had 3 clients go into this state the first day, and 2 the 
> next, we decided to downgrade the clients back to 1.8.3, while leaving 
> the servers at 1.8.4.  This morning, we had 2 more clients get stuck 
> again, so my assumption is that there is a problem with the 1.8.4 servers.
>
This ASSERTION looks related with statahead, such feature only affects 
client-side behavior, but nothing for server. On the other hand, we 
never saw that on lustre-1.8.3 or former version, and as I known, there 
were almost no changes for statahead between lustre-1.8.3 and 
lustre-1.8.4. So I am not sure whether your assumption is right not.

Anyway, to make your system run, you can disable statahead on 
client-side firstly, to check what will happen after that. The 
client-side proc interface "statahead_max" is for that.

Cheers,
Nasf
>
> These are RedHat 5.4 servers and clients, with the 
> 2.6.18-194.3.1.el5_lustre.1.8.4 kernel.
>
> Any help on this issue would be great.
>
> Thanks,
>
> Jason Temple
>
>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20100914/a4eadf0b/attachment.htm>