[Lustre-discuss] 1.8.1.1

Thu Nov 19 11:08:20 PST 2009

Derek Yarnell wrote, On 2009. 11. 10. 7:19:
> This looks exactly like what we are running into, my bugzilla 
> (https://bugzilla.lustre.org/show_bug.cgi?id=21256) was duped to this 
> bug,
>
>   https://bugzilla.lustre.org/show_bug.cgi?id=19557
>
> But I am not sure until you reported it that you are also seeing that 
> your network stack is being shut down?  We see our NIC fall down when 
> this happens, running RHEL5.3 1.8.1.0.
>
> Unclear on what a ETA on a fix is going to be from the bugzilla entry.

Well, my problem has not gone, but occurs more frequently.

There were days, when the logs were clean enough, and other days there 
were call traces, but the cluster was functional.

Until this afternoon. I wrote earlier, we have one MDS (meta1) and two 
OSS (node1 and node2). The servers meta1 and node2 are working fine.
But node1 has a load of 8, and the top says:

 4820 root      15   0     0    0    0 R 99.8  0.0  75:42.89 ll_ost_06
 4894 root      15   0     0    0    0 R 99.8  0.0  75:41.79 ll_ost_80

The logs are full with this:

Nov 19 20:03:32 node1 kernel: BUG: soft lockup - CPU#3 stuck for 10s! 
[ll_ost_80:4894]
Nov 19 20:03:32 node1 kernel: CPU 3:
Nov 19 20:03:32 node1 kernel: Modules linked in: obdfilter(U) 
fsfilt_ldiskfs(U) ost(U) mgc(U) ldiskfs(U) crc16(U) lustre(U) lov(U) 
mdc(U) lquota(U) osc(U) ksocklnd(U) ptlrpc(U) obdclass(U) lnet(U) l
vfs(U) libcfs(U) mptctl(U) mptbase(U) ipmi_devintf(U) ipmi_si(U) 
ipmi_msghandler(U) ipv6(U) xfrm_nalgo(U) crypto_api(U) autofs4(U) 
lockd(U) sunrpc(U) cpufreq_ondemand(U) acpi_cpufreq(U) freq_table(U
) dm_mirror(U) dm_multipath(U) scsi_dh(U) video(U) hwmon(U) backlight(U) 
sbs(U) i2c_ec(U) i2c_core(U) button(U) battery(U) asus_acpi(U) 
acpi_memhotplug(U) ac(U) parport_pc(U) lp(U) parport(U) sg(U)
shpchp(U) igb(U) pcspkr(U) dm_raid45(U) dm_message(U) dm_region_hash(U) 
dm_log(U) dm_mod(U) dm_mem_cache(U) usb_storage(U) cciss(U) ata_piix(U) 
libata(U) sd_mod(U) scsi_mod(U) ext3(U) jbd(U) uhci_hc
d(U) ohci_hcd(U) ehci_hcd(U)
Nov 19 20:03:32 node1 kernel: Pid: 4894, comm: ll_ost_80 Tainted: G      
2.6.18-128.1.6.el5_lustre.1.8.0.1smp #1
Nov 19 20:03:32 node1 kernel: RIP: 0010:[<ffffffff885814f8>]  
[<ffffffff885814f8>] :ptlrpc:ptlrpc_queue_wait+0xe58/0x1690
Nov 19 20:03:32 node1 kernel: RSP: 0018:ffff810220143760  EFLAGS: 00000202
Nov 19 20:03:32 node1 kernel: RAX: ffff8101f7b5a400 RBX: 
ffff81020d360000 RCX: 0000000000000000
Nov 19 20:03:32 node1 kernel: RDX: ffff8100a739a408 RSI: 
0000000000000000 RDI: ffff8101f7b5a548
Nov 19 20:03:32 node1 kernel: RBP: ffffffffffffffff R08: 
ffff81020d360128 R09: 0000000000000000
Nov 19 20:03:32 node1 kernel: R10: 0000000000000000 R11: 
0000000000000000 R12: ffff81020d36005c
Nov 19 20:03:32 node1 kernel: R13: ffff81020d360000 R14: 
0000000000000013 R15: ffffffff88577e6c
Nov 19 20:03:32 node1 kernel: FS:  00002b9c353dc230(0000) 
GS:ffff810107f61540(0000) knlGS:0000000000000000
Nov 19 20:03:32 node1 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 
000000008005003b
Nov 19 20:03:32 node1 kernel: CR2: 0000003862603080 CR3: 
0000000232d91000 CR4: 00000000000006e0
Nov 19 20:03:32 node1 kernel:
Nov 19 20:03:32 node1 kernel: Call Trace:
Nov 19 20:03:32 node1 kernel:  [<ffffffff8858f185>] 
:ptlrpc:lustre_msg_set_opc+0x45/0x120
Nov 19 20:03:32 node1 kernel:  [<ffffffff88578ca2>] 
:ptlrpc:ptlrpc_at_set_req_timeout+0x82/0xd0
Nov 19 20:03:32 node1 kernel:  [<ffffffff885794c3>] 
:ptlrpc:ptlrpc_prep_req_pool+0x613/0x6b0
Nov 19 20:03:32 node1 kernel:  [<ffffffff8008a3f6>] 
default_wake_function+0x0/0xe
Nov 19 20:03:32 node1 kernel:  [<ffffffff88568d47>] 
:ptlrpc:ldlm_server_glimpse_ast+0x257/0x3a0
Nov 19 20:03:32 node1 kernel:  [<ffffffff88575af3>] 
:ptlrpc:interval_iterate_reverse+0x73/0x240
Nov 19 20:03:32 node1 kernel:  [<ffffffff8855d850>] 
:ptlrpc:ldlm_process_extent_lock+0x0/0xad0
Nov 19 20:03:32 node1 kernel:  [<ffffffff8884205c>] 
:obdfilter:filter_intent_policy+0x66c/0x780
Nov 19 20:03:32 node1 kernel:  [<ffffffff884825b0>] 
:lnet:LNetPut+0x700/0x800
Nov 19 20:03:32 node1 kernel:  [<ffffffff88549d56>] 
:ptlrpc:ldlm_lock_enqueue+0x186/0xb30
Nov 19 20:03:32 node1 kernel:  [<ffffffff88565b5f>] 
:ptlrpc:ldlm_export_lock_get+0x6f/0xe0
Nov 19 20:03:32 node1 kernel:  [<ffffffff884ca748>] 
:obdclass:lustre_hash_add+0x208/0x2d0
Nov 19 20:03:32 node1 kernel:  [<ffffffff887f6f20>] 
:ost:ost_blocking_ast+0x0/0xaa0
Nov 19 20:03:32 node1 kernel:  [<ffffffff8856c779>] 
:ptlrpc:ldlm_handle_enqueue+0xc09/0x1200
Nov 19 20:03:32 node1 kernel:  [<ffffffff88589208>] 
:ptlrpc:lustre_msg_check_version_v2+0x8/0x20
Nov 19 20:03:32 node1 kernel:  [<ffffffff887ff110>] 
:ost:ost_handle+0x54a0/0x5a60
Nov 19 20:03:32 node1 kernel:  [<ffffffff88482de5>] 
:lnet:lnet_match_blocked_msg+0x375/0x390
Nov 19 20:03:32 node1 kernel:  [<ffffffff800d7301>] free_block+0x11c/0x13b
Nov 19 20:03:32 node1 kernel:  [<ffffffff80148fe6>] __next_cpu+0x19/0x28
Nov 19 20:03:32 node1 kernel:  [<ffffffff80148fe6>] __next_cpu+0x19/0x28
Nov 19 20:03:32 node1 kernel:  [<ffffffff80088f4a>] 
find_busiest_group+0x20d/0x621
Nov 19 20:03:32 node1 kernel:  [<ffffffff8858eb75>] 
:ptlrpc:lustre_msg_get_conn_cnt+0x35/0xf0
Nov 19 20:03:32 node1 kernel:  [<ffffffff80089d90>] enqueue_task+0x41/0x56
Nov 19 20:03:32 node1 kernel:  [<ffffffff8859382d>] 
:ptlrpc:ptlrpc_check_req+0x1d/0x110
Nov 19 20:03:32 node1 kernel:  [<ffffffff88595db3>] 
:ptlrpc:ptlrpc_server_handle_request+0xa93/0x1160
Nov 19 20:03:32 node1 kernel:  [<ffffffff80063097>] thread_return+0x62/0xfe
Nov 19 20:03:32 node1 kernel:  [<ffffffff8008882b>] 
__wake_up_common+0x3e/0x68
Nov 19 20:03:32 node1 kernel:  [<ffffffff885992e8>] 
:ptlrpc:ptlrpc_main+0x1218/0x13e0
Nov 19 20:03:32 node1 kernel:  [<ffffffff8008a3f6>] 
default_wake_function+0x0/0xe
Nov 19 20:03:32 node1 kernel:  [<ffffffff8005dfb1>] child_rip+0xa/0x11
Nov 19 20:03:32 node1 kernel:  [<ffffffff885980d0>] 
:ptlrpc:ptlrpc_main+0x0/0x13e0
Nov 19 20:03:32 node1 kernel:  [<ffffffff8005dfa7>] child_rip+0x0/0x11
Nov 19 20:03:32 node1 kernel:
Nov 19 20:03:34 node1 kernel: Lustre: Request x1319785196423799 sent 
from dafs-OST0001 to NID 192.168.3.115 at tcp 0s ago has timed out (limit 7s).
Nov 19 20:03:34 node1 kernel: Lustre: Skipped 40339060 previous similar 
messages
0; still busy with 3 active RPCs
Nov 19 20:03:46 node1 kernel: Lustre: 
4924:0:(ldlm_lib.c:815:target_handle_connect()) Skipped 85 previous 
similar messages

So the client nodes are not able to use the storage continuously.

Actually Last weekend I downgraded lustre version to 1.8.0.1, so this is 
not with 1.8.1.1, but I'm afraid, it doesn't matter.

Reboot already doesn't help at all, the problem comes back in 30 minutes.

Any idea, suggestion?
I'm a bit desperated, why this happens, and what I could do to this?

Thank you,

tamas
-------------- next part --------------
A non-text attachment was scrubbed...
Name: lustre.log.zip
Type: application/x-zip-compressed
Size: 184829 bytes
Desc: not available
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20091119/19a15494/attachment.bin>