[Lustre-discuss] 1.8.1.1
Papp Tamás
tompos at martos.bme.hu
Thu Nov 19 11:08:20 PST 2009
Derek Yarnell wrote, On 2009. 11. 10. 7:19:
> This looks exactly like what we are running into, my bugzilla
> (https://bugzilla.lustre.org/show_bug.cgi?id=21256) was duped to this
> bug,
>
> https://bugzilla.lustre.org/show_bug.cgi?id=19557
>
> But I am not sure until you reported it that you are also seeing that
> your network stack is being shut down? We see our NIC fall down when
> this happens, running RHEL5.3 1.8.1.0.
>
> Unclear on what a ETA on a fix is going to be from the bugzilla entry.
Well, my problem has not gone, but occurs more frequently.
There were days, when the logs were clean enough, and other days there
were call traces, but the cluster was functional.
Until this afternoon. I wrote earlier, we have one MDS (meta1) and two
OSS (node1 and node2). The servers meta1 and node2 are working fine.
But node1 has a load of 8, and the top says:
4820 root 15 0 0 0 0 R 99.8 0.0 75:42.89 ll_ost_06
4894 root 15 0 0 0 0 R 99.8 0.0 75:41.79 ll_ost_80
The logs are full with this:
Nov 19 20:03:32 node1 kernel: BUG: soft lockup - CPU#3 stuck for 10s!
[ll_ost_80:4894]
Nov 19 20:03:32 node1 kernel: CPU 3:
Nov 19 20:03:32 node1 kernel: Modules linked in: obdfilter(U)
fsfilt_ldiskfs(U) ost(U) mgc(U) ldiskfs(U) crc16(U) lustre(U) lov(U)
mdc(U) lquota(U) osc(U) ksocklnd(U) ptlrpc(U) obdclass(U) lnet(U) l
vfs(U) libcfs(U) mptctl(U) mptbase(U) ipmi_devintf(U) ipmi_si(U)
ipmi_msghandler(U) ipv6(U) xfrm_nalgo(U) crypto_api(U) autofs4(U)
lockd(U) sunrpc(U) cpufreq_ondemand(U) acpi_cpufreq(U) freq_table(U
) dm_mirror(U) dm_multipath(U) scsi_dh(U) video(U) hwmon(U) backlight(U)
sbs(U) i2c_ec(U) i2c_core(U) button(U) battery(U) asus_acpi(U)
acpi_memhotplug(U) ac(U) parport_pc(U) lp(U) parport(U) sg(U)
shpchp(U) igb(U) pcspkr(U) dm_raid45(U) dm_message(U) dm_region_hash(U)
dm_log(U) dm_mod(U) dm_mem_cache(U) usb_storage(U) cciss(U) ata_piix(U)
libata(U) sd_mod(U) scsi_mod(U) ext3(U) jbd(U) uhci_hc
d(U) ohci_hcd(U) ehci_hcd(U)
Nov 19 20:03:32 node1 kernel: Pid: 4894, comm: ll_ost_80 Tainted: G
2.6.18-128.1.6.el5_lustre.1.8.0.1smp #1
Nov 19 20:03:32 node1 kernel: RIP: 0010:[<ffffffff885814f8>]
[<ffffffff885814f8>] :ptlrpc:ptlrpc_queue_wait+0xe58/0x1690
Nov 19 20:03:32 node1 kernel: RSP: 0018:ffff810220143760 EFLAGS: 00000202
Nov 19 20:03:32 node1 kernel: RAX: ffff8101f7b5a400 RBX:
ffff81020d360000 RCX: 0000000000000000
Nov 19 20:03:32 node1 kernel: RDX: ffff8100a739a408 RSI:
0000000000000000 RDI: ffff8101f7b5a548
Nov 19 20:03:32 node1 kernel: RBP: ffffffffffffffff R08:
ffff81020d360128 R09: 0000000000000000
Nov 19 20:03:32 node1 kernel: R10: 0000000000000000 R11:
0000000000000000 R12: ffff81020d36005c
Nov 19 20:03:32 node1 kernel: R13: ffff81020d360000 R14:
0000000000000013 R15: ffffffff88577e6c
Nov 19 20:03:32 node1 kernel: FS: 00002b9c353dc230(0000)
GS:ffff810107f61540(0000) knlGS:0000000000000000
Nov 19 20:03:32 node1 kernel: CS: 0010 DS: 0000 ES: 0000 CR0:
000000008005003b
Nov 19 20:03:32 node1 kernel: CR2: 0000003862603080 CR3:
0000000232d91000 CR4: 00000000000006e0
Nov 19 20:03:32 node1 kernel:
Nov 19 20:03:32 node1 kernel: Call Trace:
Nov 19 20:03:32 node1 kernel: [<ffffffff8858f185>]
:ptlrpc:lustre_msg_set_opc+0x45/0x120
Nov 19 20:03:32 node1 kernel: [<ffffffff88578ca2>]
:ptlrpc:ptlrpc_at_set_req_timeout+0x82/0xd0
Nov 19 20:03:32 node1 kernel: [<ffffffff885794c3>]
:ptlrpc:ptlrpc_prep_req_pool+0x613/0x6b0
Nov 19 20:03:32 node1 kernel: [<ffffffff8008a3f6>]
default_wake_function+0x0/0xe
Nov 19 20:03:32 node1 kernel: [<ffffffff88568d47>]
:ptlrpc:ldlm_server_glimpse_ast+0x257/0x3a0
Nov 19 20:03:32 node1 kernel: [<ffffffff88575af3>]
:ptlrpc:interval_iterate_reverse+0x73/0x240
Nov 19 20:03:32 node1 kernel: [<ffffffff8855d850>]
:ptlrpc:ldlm_process_extent_lock+0x0/0xad0
Nov 19 20:03:32 node1 kernel: [<ffffffff8884205c>]
:obdfilter:filter_intent_policy+0x66c/0x780
Nov 19 20:03:32 node1 kernel: [<ffffffff884825b0>]
:lnet:LNetPut+0x700/0x800
Nov 19 20:03:32 node1 kernel: [<ffffffff88549d56>]
:ptlrpc:ldlm_lock_enqueue+0x186/0xb30
Nov 19 20:03:32 node1 kernel: [<ffffffff88565b5f>]
:ptlrpc:ldlm_export_lock_get+0x6f/0xe0
Nov 19 20:03:32 node1 kernel: [<ffffffff884ca748>]
:obdclass:lustre_hash_add+0x208/0x2d0
Nov 19 20:03:32 node1 kernel: [<ffffffff887f6f20>]
:ost:ost_blocking_ast+0x0/0xaa0
Nov 19 20:03:32 node1 kernel: [<ffffffff8856c779>]
:ptlrpc:ldlm_handle_enqueue+0xc09/0x1200
Nov 19 20:03:32 node1 kernel: [<ffffffff88589208>]
:ptlrpc:lustre_msg_check_version_v2+0x8/0x20
Nov 19 20:03:32 node1 kernel: [<ffffffff887ff110>]
:ost:ost_handle+0x54a0/0x5a60
Nov 19 20:03:32 node1 kernel: [<ffffffff88482de5>]
:lnet:lnet_match_blocked_msg+0x375/0x390
Nov 19 20:03:32 node1 kernel: [<ffffffff800d7301>] free_block+0x11c/0x13b
Nov 19 20:03:32 node1 kernel: [<ffffffff80148fe6>] __next_cpu+0x19/0x28
Nov 19 20:03:32 node1 kernel: [<ffffffff80148fe6>] __next_cpu+0x19/0x28
Nov 19 20:03:32 node1 kernel: [<ffffffff80088f4a>]
find_busiest_group+0x20d/0x621
Nov 19 20:03:32 node1 kernel: [<ffffffff8858eb75>]
:ptlrpc:lustre_msg_get_conn_cnt+0x35/0xf0
Nov 19 20:03:32 node1 kernel: [<ffffffff80089d90>] enqueue_task+0x41/0x56
Nov 19 20:03:32 node1 kernel: [<ffffffff8859382d>]
:ptlrpc:ptlrpc_check_req+0x1d/0x110
Nov 19 20:03:32 node1 kernel: [<ffffffff88595db3>]
:ptlrpc:ptlrpc_server_handle_request+0xa93/0x1160
Nov 19 20:03:32 node1 kernel: [<ffffffff80063097>] thread_return+0x62/0xfe
Nov 19 20:03:32 node1 kernel: [<ffffffff8008882b>]
__wake_up_common+0x3e/0x68
Nov 19 20:03:32 node1 kernel: [<ffffffff885992e8>]
:ptlrpc:ptlrpc_main+0x1218/0x13e0
Nov 19 20:03:32 node1 kernel: [<ffffffff8008a3f6>]
default_wake_function+0x0/0xe
Nov 19 20:03:32 node1 kernel: [<ffffffff8005dfb1>] child_rip+0xa/0x11
Nov 19 20:03:32 node1 kernel: [<ffffffff885980d0>]
:ptlrpc:ptlrpc_main+0x0/0x13e0
Nov 19 20:03:32 node1 kernel: [<ffffffff8005dfa7>] child_rip+0x0/0x11
Nov 19 20:03:32 node1 kernel:
Nov 19 20:03:34 node1 kernel: Lustre: Request x1319785196423799 sent
from dafs-OST0001 to NID 192.168.3.115 at tcp 0s ago has timed out (limit 7s).
Nov 19 20:03:34 node1 kernel: Lustre: Skipped 40339060 previous similar
messages
0; still busy with 3 active RPCs
Nov 19 20:03:46 node1 kernel: Lustre:
4924:0:(ldlm_lib.c:815:target_handle_connect()) Skipped 85 previous
similar messages
So the client nodes are not able to use the storage continuously.
Actually Last weekend I downgraded lustre version to 1.8.0.1, so this is
not with 1.8.1.1, but I'm afraid, it doesn't matter.
Reboot already doesn't help at all, the problem comes back in 30 minutes.
Any idea, suggestion?
I'm a bit desperated, why this happens, and what I could do to this?
Thank you,
tamas
-------------- next part --------------
A non-text attachment was scrubbed...
Name: lustre.log.zip
Type: application/x-zip-compressed
Size: 184829 bytes
Desc: not available
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20091119/19a15494/attachment.bin>
More information about the lustre-discuss
mailing list