[Lustre-discuss] Soft CPU Lockup

Hendelman, Rob Rob.Hendelman at magnetar.com
Mon Oct 5 11:40:11 PDT 2009


Hi list!

We have a very simple Lustre setup as follows:

Server1 (MGS/MDS)
1 mgs/mds that contains 3 lun's for 2 Lustre filesystems...
1 lun = mgs data
1 lun = home dirs for users
1 lun = research data

Server2
(Currently unused)

Server3 (OSS for research data - no errors)

Server4 (OSS for mds1 that contains homedir data)
12 ost's approximately 1.1T ea.

All servers are running Centos 5.1 with 1.6.7.2 rpm's from sun.  We also
have 5 clients that are running Ubuntu + 2.6.22.19/patchless.  Today
client1 lost its Lustre mounts (a df -h hangs) but other clients were
all ok.

On the oss for the homedir data, I saw the following in /var/log/syslog:

Oct  5 13:07:48 maglustre04 kernel: 
Oct  5 13:07:58 maglustre04 kernel: BUG: soft lockup - CPU#1 stuck for
10s! [ll_ost_35:13366]
Oct  5 13:07:58 maglustre04 kernel: CPU 1:
Oct  5 13:07:58 maglustre04 kernel: Modules linked in: obdfilter(U)
fsfilt_ldiskfs(U) ost(U) mgc(U) ldiskfs(U) crc16(U) lustre(U) lov(U)
mdc(U) lquota(U) osc(U)
 ksocklnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) libcfs(U) ipv6(U)
xfrm_nalgo(U) crypto_api(U) autofs4(U) sunrpc(U) dm_round_robin(U)
dm_emc(U) dm_multipath(U
) video(U) sbs(U) backlight(U) i2c_ec(U) i2c_core(U) button(U)
battery(U) asus_acpi(U) acpi_memhotplug(U) ac(U) parport_pc(U) lp(U)
parport(U) sg(U) pata_acpi(U
) lpfc(U) ide_cd(U) hpwdt(U) bnx2(U) shpchp(U) cdrom(U)
scsi_transport_fc(U) i5000_edac(U) serio_raw(U) edac_mc(U) pcspkr(U)
dm_snapshot(U) dm_zero(U) dm_mirror
(U) dm_mod(U) ata_piix(U) libata(U) cciss(U) sd_mod(U) scsi_mod(U)
ext3(U) jbd(U) ehci_hcd(U) ohci_hcd(U) uhci_hcd(U)
Oct  5 13:07:58 maglustre04 kernel: Pid: 13366, comm: ll_ost_35 Tainted:
G      2.6.18-92.1.26.el5_lustre.1.6.7.2smp #1
Oct  5 13:07:58 maglustre04 kernel: RIP: 0010:[<ffffffff8856caed>]
[<ffffffff8856caed>] :ptlrpc:ptlrpc_queue_wait+0x93d/0x1690
Oct  5 13:07:58 maglustre04 kernel: RSP: 0018:ffff8101acd09780  EFLAGS:
00000202
Oct  5 13:07:58 maglustre04 kernel: RAX: ffff81051bf1cc00 RBX:
ffff8103f1b76000 RCX: 0000000000080000
Oct  5 13:07:58 maglustre04 kernel: RDX: ffff81051bf1cca0 RSI:
ffff81023e40fc08 RDI: ffff8103f1b76008
Oct  5 13:07:58 maglustre04 kernel: RBP: ffff8103f1b7605c R08:
00000000ffffffff R09: 0000000000000020
Oct  5 13:07:58 maglustre04 kernel: R10: 0000000000000000 R11:
0000000000000000 R12: ffff8103f1b76000
Oct  5 13:07:58 maglustre04 kernel: R13: ffff8103f1b76000 R14:
0000000000000013 R15: ffffffff885657ec
Oct  5 13:07:58 maglustre04 kernel: FS:  00002b9509e77220(0000)
GS:ffff81052ff9a640(0000) knlGS:0000000000000000
Oct  5 13:07:58 maglustre04 kernel: CS:  0010 DS: 0000 ES: 0000 CR0:
000000008005003b
Oct  5 13:07:58 maglustre04 kernel: CR2: 0000003184c99a60 CR3:
0000000000201000 CR4: 00000000000006e0
Oct  5 13:07:58 maglustre04 kernel: 
Oct  5 13:07:58 maglustre04 kernel: Call Trace:
Oct  5 13:07:58 maglustre04 kernel:  [<ffffffff885768b5>]
:ptlrpc:lustre_msg_set_opc+0x45/0x120
Oct  5 13:07:58 maglustre04 kernel:  [<ffffffff88566e73>]
:ptlrpc:ptlrpc_prep_req_pool+0x613/0x6b0
Oct  5 13:07:58 maglustre04 kernel:  [<ffffffff8008abbc>]
default_wake_function+0x0/0xe
Oct  5 13:07:58 maglustre04 kernel:  [<ffffffff88554a87>]
:ptlrpc:ldlm_server_glimpse_ast+0x257/0x3a0
Oct  5 13:07:58 maglustre04 kernel:  [<ffffffff88561953>]
:ptlrpc:interval_iterate_reverse+0x73/0x240
Oct  5 13:07:58 maglustre04 kernel:  [<ffffffff88549700>]
:ptlrpc:ldlm_process_extent_lock+0x0/0xad0
Oct  5 13:07:58 maglustre04 kernel:  [<ffffffff8881818c>]
:obdfilter:filter_intent_policy+0x68c/0x7a0
Oct  5 13:07:58 maglustre04 kernel:  [<ffffffff88536d76>]
:ptlrpc:ldlm_lock_enqueue+0x186/0xb00
Oct  5 13:07:58 maglustre04 kernel:  [<ffffffff885518ef>]
:ptlrpc:ldlm_export_lock_get+0x6f/0xe0
Oct  5 13:07:58 maglustre04 kernel:  [<ffffffff884ba688>]
:obdclass:lustre_hash_add+0x208/0x2d0
Oct  5 13:07:58 maglustre04 kernel:  [<ffffffff8855a490>]
:ptlrpc:ldlm_server_blocking_ast+0x0/0x833
Oct  5 13:07:58 maglustre04 kernel:  [<ffffffff885585e9>]
:ptlrpc:ldlm_handle_enqueue+0xc09/0x1200
Oct  5 13:07:58 maglustre04 kernel:  [<ffffffff885751b8>]
:ptlrpc:lustre_msg_check_version_v2+0x8/0x20
Oct  5 13:07:58 maglustre04 kernel:  [<ffffffff887d630a>]
:ost:ost_handle+0x565a/0x5cd0
Oct  5 13:07:58 maglustre04 kernel:  [<ffffffff80143b75>]
__next_cpu+0x19/0x28
Oct  5 13:07:58 maglustre04 kernel:  [<ffffffff80143b75>]
__next_cpu+0x19/0x28
Oct  5 13:07:58 maglustre04 kernel:  [<ffffffff800898e6>]
find_busiest_group+0x20d/0x621
Oct  5 13:07:58 maglustre04 kernel:  [<ffffffff88574795>]
:ptlrpc:lustre_msg_get_conn_cnt+0x35/0xf0
Oct  5 13:07:58 maglustre04 kernel:  [<ffffffff8857ceea>]
:ptlrpc:ptlrpc_server_request_get+0x6a/0x150
Oct  5 13:07:58 maglustre04 kernel:  [<ffffffff8857ed6d>]
:ptlrpc:ptlrpc_check_req+0x1d/0x110
Oct  5 13:07:58 maglustre04 kernel:  [<ffffffff885812f3>]
:ptlrpc:ptlrpc_server_handle_request+0xa93/0x1150
Oct  5 13:07:58 maglustre04 kernel:  [<ffffffff80062f4b>]
thread_return+0x0/0xdf
Oct  5 13:07:58 maglustre04 kernel:  [<ffffffff8006d8a2>]
do_gettimeofday+0x40/0x8f
Oct  5 13:07:58 maglustre04 kernel:  [<ffffffff884247c6>]
:libcfs:lcw_update_time+0x16/0x100
Oct  5 13:07:58 maglustre04 kernel:  [<ffffffff800891f9>]
__wake_up_common+0x3e/0x68
Oct  5 13:07:58 maglustre04 kernel:  [<ffffffff885847e8>]
:ptlrpc:ptlrpc_main+0x1218/0x13e0
Oct  5 13:07:58 maglustre04 kernel:  [<ffffffff8008abbc>]
default_wake_function+0x0/0xe
Oct  5 13:07:58 maglustre04 kernel:  [<ffffffff800b4391>]
audit_syscall_exit+0x31b/0x336
Oct  5 13:07:58 maglustre04 kernel:  [<ffffffff8005dfb1>]
child_rip+0xa/0x11
Oct  5 13:07:58 maglustre04 kernel:  [<ffffffff885835d0>]
:ptlrpc:ptlrpc_main+0x0/0x13e0
Oct  5 13:07:58 maglustre04 kernel:  [<ffffffff8005dfa7>]
child_rip+0x0/0x11
Oct  5 13:07:58 maglustre04 kernel: 
Oct  5 13:07:58 maglustre04 kernel: BUG: soft lockup - CPU#4 stuck for
10s! [ll_ost_90:13421]
Oct  5 13:07:58 maglustre04 kernel: CPU 4:
Oct  5 13:07:58 maglustre04 kernel: Modules linked in: obdfilter(U)
fsfilt_ldiskfs(U) ost(U) mgc(U) ldiskfs(U) crc16(U) lustre(U) lov(U)
mdc(U) lquota(U) osc(U)
 ksocklnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) libcfs(U) ipv6(U)
xfrm_nalgo(U) crypto_api(U) autofs4(U) sunrpc(U) dm_round_robin(U)
dm_emc(U) dm_multipath(U
) video(U) sbs(U) backlight(U) i2c_ec(U) i2c_core(U) button(U)
battery(U) asus_acpi(U) acpi_memhotplug(U) ac(U) parport_pc(U) lp(U)
parport(U) sg(U) pata_acpi(U
) lpfc(U) ide_cd(U) hpwdt(U) bnx2(U) shpchp(U) cdrom(U)
scsi_transport_fc(U) i5000_edac(U) serio_raw(U) edac_mc(U) pcspkr(U)
dm_snapshot(U) dm_zero(U) dm_mirror
(U) dm_mod(U) ata_piix(U) libata(U) cciss(U) sd_mod(U) scsi_mod(U)
ext3(U) jbd(U) ehci_hcd(U) ohci_hcd(U) uhci_hcd(U)
Oct  5 13:07:58 maglustre04 kernel: Pid: 13421, comm: ll_ost_90 Tainted:
G      2.6.18-92.1.26.el5_lustre.1.6.7.2smp #1
Oct  5 13:07:58 maglustre04 kernel: RIP: 0010:[<ffffffff8005475f>]
[<ffffffff8005475f>] strrchr+0x19/0x24
Oct  5 13:07:58 maglustre04 kernel: RSP: 0018:ffff8101e534f358  EFLAGS:
00000212
Oct  5 13:07:58 maglustre04 kernel: RAX: ffffffff885a6497 RBX:
ffffffff885ae804 RCX: 0000000000000039
Oct  5 13:07:58 maglustre04 kernel: RDX: ffffffff885a6460 RSI:
000000000000002f RDI: ffffffff885a6499
Oct  5 13:07:58 maglustre04 kernel: RBP: 0000010000000100 R08:
ffffffff8859ebe0 R09: 00000000000007b7
Oct  5 13:07:58 maglustre04 kernel: R10: ffffffff885ae831 R11:
ffffffff885ae804 R12: ffffffff00000107
Oct  5 13:07:58 maglustre04 kernel: R13: ffff8101b0531b58 R14:
00000000000000b4 R15: 000000a800000100
Oct  5 13:07:58 maglustre04 kernel: FS:  00002b9509e77220(0000)
GS:ffff81052fe21b40(0000) knlGS:0000000000000000
Oct  5 13:07:58 maglustre04 kernel: CS:  0010 DS: 0000 ES: 0000 CR0:
000000008005003b
Oct  5 13:07:58 maglustre04 kernel: CR2: 0000003184c6bf00 CR3:
0000000000201000 CR4: 00000000000006e0
Oct  5 13:07:58 maglustre04 kernel: 
Oct  5 13:07:58 maglustre04 kernel: Call Trace:
Oct  5 13:07:58 maglustre04 kernel:  [<ffffffff884232aa>]
:libcfs:libcfs_debug_vmsg2+0x4a/0x980
Oct  5 13:07:58 maglustre04 kernel:  [<ffffffff88576005>]
:ptlrpc:_debug_req+0x4b5/0x4d0
Oct  5 13:07:58 maglustre04 kernel:  [<ffffffff8002e1d8>]
__wake_up+0x38/0x4f
Oct  5 13:07:58 maglustre04 kernel:  [<ffffffff88576005>]
:ptlrpc:_debug_req+0x4b5/0x4d0
Oct  5 13:07:58 maglustre04 kernel:  [<ffffffff88566242>]
:ptlrpc:ptlrpc_expire_one_request+0x1d2/0x530
Oct  5 13:07:58 maglustre04 kernel:  [<ffffffff885657ec>]
:ptlrpc:ptlrpc_unregister_reply+0x13c/0x9c0
Oct  5 13:07:58 maglustre04 kernel:  [<ffffffff8856929d>]
:ptlrpc:ptlrpc_check_reply+0x18d/0x530
Oct  5 13:07:58 maglustre04 kernel:  [<ffffffff8856caa0>]
:ptlrpc:ptlrpc_queue_wait+0x8f0/0x1690
Oct  5 13:07:58 maglustre04 kernel:  [<ffffffff885768b5>]
:ptlrpc:lustre_msg_set_opc+0x45/0x120
Oct  5 13:07:58 maglustre04 kernel:  [<ffffffff88566e73>]
:ptlrpc:ptlrpc_prep_req_pool+0x613/0x6b0
Oct  5 13:07:58 maglustre04 kernel:  [<ffffffff8008abbc>]
default_wake_function+0x0/0xe
Oct  5 13:07:58 maglustre04 kernel:  [<ffffffff88554a87>]
:ptlrpc:ldlm_server_glimpse_ast+0x257/0x3a0
Oct  5 13:07:58 maglustre04 kernel:  [<ffffffff88561953>]
:ptlrpc:interval_iterate_reverse+0x73/0x240
Oct  5 13:07:58 maglustre04 kernel:  [<ffffffff88549700>]
:ptlrpc:ldlm_process_extent_lock+0x0/0xad0
Oct  5 13:07:58 maglustre04 kernel:  [<ffffffff8881818c>]
:obdfilter:filter_intent_policy+0x68c/0x7a0
Oct  5 13:07:58 maglustre04 kernel:  [<ffffffff88536d76>]
:ptlrpc:ldlm_lock_enqueue+0x186/0xb00
Oct  5 13:07:58 maglustre04 kernel:  [<ffffffff885518ef>]
:ptlrpc:ldlm_export_lock_get+0x6f/0xe0
Oct  5 13:07:58 maglustre04 kernel:  [<ffffffff884ba688>]
:obdclass:lustre_hash_add+0x208/0x2d0
Oct  5 13:07:58 maglustre04 kernel:  [<ffffffff8855a490>]
:ptlrpc:ldlm_server_blocking_ast+0x0/0x833
Oct  5 13:07:58 maglustre04 kernel:  [<ffffffff885585e9>]
:ptlrpc:ldlm_handle_enqueue+0xc09/0x1200
Oct  5 13:07:58 maglustre04 kernel:  [<ffffffff885751b8>]
:ptlrpc:lustre_msg_check_version_v2+0x8/0x20
Oct  5 13:07:58 maglustre04 kernel:  [<ffffffff887d630a>]
:ost:ost_handle+0x565a/0x5cd0
Oct  5 13:07:58 maglustre04 kernel:  [<ffffffff80143b75>]
__next_cpu+0x19/0x28
Oct  5 13:07:58 maglustre04 kernel:  [<ffffffff800898e6>]
find_busiest_group+0x20d/0x621
Oct  5 13:07:58 maglustre04 kernel:  [<ffffffff88574795>]
:ptlrpc:lustre_msg_get_conn_cnt+0x35/0xf0
Oct  5 13:07:58 maglustre04 kernel:  [<ffffffff8857ceea>]
:ptlrpc:ptlrpc_server_request_get+0x6a/0x150
Oct  5 13:07:58 maglustre04 kernel:  [<ffffffff8857ed6d>]
:ptlrpc:ptlrpc_check_req+0x1d/0x110
Oct  5 13:07:58 maglustre04 kernel:  [<ffffffff885812f3>]
:ptlrpc:ptlrpc_server_handle_request+0xa93/0x1150
Oct  5 13:07:58 maglustre04 kernel:  [<ffffffff80062f4b>]
thread_return+0x0/0xdf
Oct  5 13:07:58 maglustre04 kernel:  [<ffffffff8006d8a2>]
do_gettimeofday+0x40/0x8f
Oct  5 13:07:58 maglustre04 kernel:  [<ffffffff884247c6>]
:libcfs:lcw_update_time+0x16/0x100
Oct  5 13:07:58 maglustre04 kernel:  [<ffffffff800891f9>]
__wake_up_common+0x3e/0x68
Oct  5 13:07:58 maglustre04 kernel:  [<ffffffff885847e8>]
:ptlrpc:ptlrpc_main+0x1218/0x13e0
Oct  5 13:07:58 maglustre04 kernel:  [<ffffffff8008abbc>]
default_wake_function+0x0/0xe
Oct  5 13:07:58 maglustre04 kernel:  [<ffffffff800b4391>]
audit_syscall_exit+0x31b/0x336
Oct  5 13:07:58 maglustre04 kernel:  [<ffffffff8005dfb1>]
child_rip+0xa/0x11
Oct  5 13:07:58 maglustre04 kernel:  [<ffffffff885835d0>]
:ptlrpc:ptlrpc_main+0x0/0x13e0
Oct  5 13:07:58 maglustre04 kernel:  [<ffffffff8005dfa7>]
child_rip+0x0/0x11
Oct  5 13:07:58 maglustre04 kernel: 

After searching Bugzilla, it appears it may be bug #19785.  Do you guys
agree with this?  The difference is that the "RIP" line there contains a
reference to text.lock.spinklock and for us it contains strrchr (for one
thread) and ptlrpc_queue_wait on the other thread.

In the meantime, server4 (maglustre04) has two hung threads (100% cpu)
which appear to be OST/io related.  What is the correct way to resolve
this?

Thank you,

Robert

The information contained in this message and its attachments 
is intended only for the private and confidential use of the 
intended recipient(s).  If you are not the intended recipient 
(or have received this e-mail in error) please notify the 
sender immediately and destroy this e-mail. Any unauthorized 
copying, disclosure or distribution of the material in this e-
mail is strictly prohibited.



More information about the lustre-discuss mailing list