[Lustre-discuss] 1.8.0 Losing connection to the MDT for several minutes and then recovering.
Christopher J.Walker
C.J.Walker at qmul.ac.uk
Fri Oct 9 03:25:50 PDT 2009
Lundgren, Andrew wrote:
> I haven’t done anything yet. The machines seem to reconnect without
> intervention. But the problem occurs again later.
>
>
>
> *From:* Đào Thị Thảo [mailto:thaodt at isds.vn]
> *Sent:* Tuesday, October 06, 2009 9:43 PM
> *To:* Lundgren, Andrew
> *Cc:* lustre-discuss
> *Subject:* Re: [Lustre-discuss] 1.8.0 Loosing connection to the MDT for
> several minutes and then recovering.
>
>
>
> hi,
> I have the same problem with Lundgren.
> I don't understand why it's happen. While, my network is still stable.
> This problem is repeated. Some time, client can't connection restored to
> service (OSTs or MDS) . My provisional measure is rebooting the node.
> Andrew, can you explain more detail and guide how to fix it?
>
> On Wed, Oct 7, 2009 at 12:20 AM, Lundgren, Andrew
> <Andrew.Lundgren at level3.com <mailto:Andrew.Lundgren at level3.com>> wrote:
>
> Oh man, that should have read LOSING!
>
>
>
> *From:* lustre-discuss-bounces at lists.lustre.org
> <mailto:lustre-discuss-bounces at lists.lustre.org>
> [mailto:lustre-discuss-bounces at lists.lustre.org
> <mailto:lustre-discuss-bounces at lists.lustre.org>] *On Behalf Of
> *Lundgren, Andrew
> *Sent:* Tuesday, October 06, 2009 11:14 AM
> *To:* lustre-discuss
> *Subject:* [Lustre-discuss] 1.8.0 Loosing connection to the MDT for
> several minutes and then recovering.
>
>
>
> We have a few 1.8.0 clusters running. We have seen multiple instances
> now where the clients lose connectivity to the MDT. The MDS logs
> indicate that there is some sort of problem on the MDT.
>
>
>
> The following is a typical output:
>
>
>
> Oct 6 02:56:08 mint1502 kernel: LustreError:
> 28999:0:(handler.c:161:mds_sendpage()) @@@ bulk failed: timeout 0(4096),
> evicting 7523f416-2579-5f49-cd3f-54d2438b84bb at NET_0x20000ce213b0b_UUID
>
> Oct 6 02:56:08 mint1502 kernel: req at ffff8100ac9f4000
> x1314647461000449/t0
> o37->7523f416-2579-5f49-cd3f-54d2438b84bb at NET_0x20000ce213b0b_UUID:0/0
> lens 408/360 e 1 to 0 dl 1254797793 ref 1 fl Interpret:/0/0 rc 0/0
>
> Oct 6 02:56:16 mint1502 kernel: Lustre: Request x1312747398000879 sent
> from content-OST001d-osc to NID 207.123.49.16 at tcp 7s ago has timed out
> (limit 7s).
>
> Oct 6 02:56:16 mint1502 kernel: LustreError: 166-1:
> content-OST001d-osc: Connection to service content-OST001d via nid
> 207.123.49.16 at tcp was lost; in progress operations using this service
> will fail.
>
> Oct 6 02:56:16 mint1502 kernel: LustreError: Skipped 1 previous similar
> message
>
> Oct 6 02:56:17 mint1502 kernel: LustreError: 166-1:
> content-OST001c-osc: Connection to service content-OST001c via nid
> 207.123.49.16 at tcp was lost; in progress operations using this service
> will fail.
>
> Oct 6 02:56:17 mint1502 kernel: LustreError: Skipped 1 previous similar
> message
>
> Oct 6 02:56:18 mint1502 kernel: LustreError: 138-a: content-MDT0000: A
> client on nid 207.123.49.16 at tcp was evicted due to a lock blocking
> callback to 207.123.49.16 at tcp timed out: rc -107
>
> Oct 6 02:56:18 mint1502 kernel: LustreError: 138-a: content-MDT0000: A
> client on nid 207.123.49.9 at tcp was evicted due to a lock blocking
> callback to 207.123.49.9 at tcp timed out: rc -107
>
> Oct 6 02:56:18 mint1502 kernel: BUG: soft lockup - CPU#2 stuck for 10s!
> [ll_mdt_rdpg_04:28999]
>
> Oct 6 02:56:18 mint1502 kernel: CPU 2:
>
> Oct 6 02:56:18 mint1502 kernel: Modules linked in: mds(U)
> fsfilt_ldiskfs(U) mgs(U) mgc(U) ldiskfs(U) crc16(U) lustre(U) lov(U)
> mdc(U) lquota(U) osc(U) ksocklnd(U) ptlrpc(U) obdclass(U) lnet(U)
> lvfs(U) libcfs(U) ipv6(U) xfrm_nalgo(U) crypto_api(U) sunrpc(U)
> bonding(U) ip_conntrack_netbios_ns(U) ipt_REJECT(U) xt_tcpudp(U)
> xt_state(U) ip_conntrack(U) nfnetlink(U) iptable_filter(U) ip_tables(U)
> x_tables(U) dm_round_robin(U) dm_rdac(U) dm_multipath(U) video(U) sbs(U)
> backlight(U) i2c_ec(U) button(U) battery(U) asus_acpi(U)
> acpi_memhotplug(U) ac(U) parport_pc(U) lp(U) parport(U) joydev(U)
> i2c_i801(U) e1000e(U) sr_mod(U) i2c_core(U) i5000_edac(U) cdrom(U)
> pata_acpi(U) shpchp(U) edac_mc(U) serio_raw(U) sg(U) pcspkr(U)
> dm_snapshot(U) dm_zero(U) dm_mirror(U) dm_mod(U) usb_storage(U) lpfc(U)
> scsi_transport_fc(U) ahci(U) ata_piix(U) libata(U) mptsas(U) mptscsih(U)
> mptbase(U) scsi_transport_sas(U) sd_mod(U) scsi_mod(U) ext3(U) jbd(U)
> uhci_hcd(U) ohci_hcd(U) ehci_hcd(U)
>
> Oct 6 02:56:18 mint1502 kernel: Pid: 28999, comm: ll_mdt_rdpg_04
> Tainted: G 2.6.18-92.1.17.el5_lustre.1.8.0smp #1
>
> Oct 6 02:56:18 mint1502 kernel: RIP: 0010:[<ffffffff8858f470>]
> [<ffffffff8858f470>] :obdclass:lustre_hash_for_each_empty+0x1f0/0x290
>
> Oct 6 02:56:18 mint1502 kernel: RSP: 0018:ffff8104c02ad850 EFLAGS:
> 00000206
>
> Oct 6 02:56:18 mint1502 kernel: RAX: ffff810448dfd200 RBX:
> 000000000000328c RCX: 000000000000ba75
>
> Oct 6 02:56:18 mint1502 kernel: RDX: ffffffffffff5e7d RSI:
> ffffffff802f0d80 RDI: ffffc200109d78cc
>
> Oct 6 02:56:18 mint1502 kernel: RBP: ffffffff8860c8f2 R08:
> ffff810001016e60 R09: 0000000000000000
>
> Oct 6 02:56:18 mint1502 kernel: R10: ffff8100b7a28500 R11:
> 0000000000000150 R12: ffff810448dfd200
>
> Oct 6 02:56:18 mint1502 kernel: R13: 0000000000000286 R14:
> 0000000000000286 R15: ffff8104c02ad7f0
>
> Oct 6 02:56:18 mint1502 kernel: FS: 00002b49ff256220(0000)
> GS:ffff81051fc53d40(0000) knlGS:0000000000000000
>
> Oct 6 02:56:18 mint1502 kernel: CS: 0010 DS: 0000 ES: 0000 CR0:
> 000000008005003b
>
> Oct 6 02:56:18 mint1502 kernel: CR2: 0000003d82e985d0 CR3:
> 0000000000201000 CR4: 00000000000006e0
>
> Oct 6 02:56:18 mint1502 kernel:
>
> Oct 6 02:56:18 mint1502 kernel: Call Trace:
>
> Oct 6 02:56:18 mint1502 kernel: [<ffffffff885962b8>]
> :obdclass:class_disconnect+0x378/0x400
>
> Oct 6 02:56:18 mint1502 kernel: [<ffffffff8860e590>]
> :ptlrpc:ldlm_cancel_locks_for_export_cb+0x0/0xc0
>
> Oct 6 02:56:18 mint1502 kernel: [<ffffffff8891b301>]
> :mds:mds_disconnect+0x121/0xe30
>
> Oct 6 02:56:18 mint1502 kernel: [<ffffffff8003d514>]
> lock_timer_base+0x1b/0x3c
>
> Oct 6 02:56:19 mint1502 kernel: [<ffffffff8004a824>]
> try_to_del_timer_sync+0x51/0x5a
>
> Oct 6 02:56:19 mint1502 kernel: [<ffffffff885920e4>]
> :obdclass:class_fail_export+0x384/0x4c0
>
> Oct 6 02:56:19 mint1502 kernel: [<ffffffff889150d1>]
> :mds:mds_readpage+0x1571/0x18a0
>
> Oct 6 02:56:19 mint1502 kernel: [<ffffffff886538d5>]
> :ptlrpc:lustre_msg_set_limit+0x35/0xf0
>
> Oct 6 02:56:19 mint1502 kernel: [<ffffffff8008abb9>]
> default_wake_function+0x0/0xe
>
> Oct 6 02:56:19 mint1502 kernel: [<ffffffff88918470>]
> :mds:mds_handle+0x2140/0x4c80
>
> Oct 6 02:56:19 mint1502 kernel: [<ffffffff80143809>] __next_cpu+0x19/0x28
>
> Oct 6 02:56:19 mint1502 kernel: [<ffffffff886539c5>]
> :ptlrpc:lustre_msg_get_conn_cnt+0x35/0xf0
>
> Oct 6 02:56:19 mint1502 kernel: [<ffffffff8865867d>]
> :ptlrpc:ptlrpc_check_req+0x1d/0x110
>
> Oct 6 02:56:19 mint1502 kernel: [<ffffffff8865ac03>]
> :ptlrpc:ptlrpc_server_handle_request+0xa93/0x1160
>
> Oct 6 02:56:19 mint1502 kernel: [<ffffffff80062f4b>]
> thread_return+0x0/0xdf
>
> Oct 6 02:56:19 mint1502 kernel: [<ffffffff800891f6>]
> __wake_up_common+0x3e/0x68
>
> Oct 6 02:56:19 mint1502 kernel: [<ffffffff8865e138>]
> :ptlrpc:ptlrpc_main+0x1218/0x13e0
>
> Oct 6 02:56:19 mint1502 kernel: [<ffffffff8008abb9>]
> default_wake_function+0x0/0xe
>
> Oct 6 02:56:19 mint1502 kernel: [<ffffffff800b4382>]
> audit_syscall_exit+0x31b/0x336
>
> Oct 6 02:56:19 mint1502 kernel: [<ffffffff8005dfb1>] child_rip+0xa/0x11
>
> Oct 6 02:56:19 mint1502 kernel: [<ffffffff8865cf20>]
> :ptlrpc:ptlrpc_main+0x0/0x13e0
>
> Oct 6 02:56:19 mint1502 kernel: [<ffffffff8005dfa7>] child_rip+0x0/0x11
>
> Oct 6 02:56:19 mint1502 kernel:
>
>
>
> Those go on for a while until it recovers:
>
>
>
> Oct 6 03:01:22 mint1502 kernel: LustreError: Skipped 1 previous similar
> message
>
> Oct 6 03:02:48 mint1502 kernel: Lustre: Request x1312747398018956 sent
> from content-OST000a-osc to NID 207.123.49.37 at tcp 26s ago has timed out
> (limit 26s).
>
> Oct 6 03:02:48 mint1502 kernel: Lustre: Skipped 14 previous similar
> messages
>
> Oct 6 03:03:12 mint1502 kernel: Lustre:
> 4713:0:(import.c:508:import_select_connection()) content-OST000a-osc:
> tried all connections, increasing latency to 26s
>
> Oct 6 03:03:12 mint1502 kernel: Lustre:
> 4713:0:(import.c:508:import_select_connection()) Skipped 9 previous
> similar messages
>
> Oct 6 03:03:35 mint1502 kernel: Lustre: MGS: haven't heard from client
> e5534e18-4e9d-cbf4-7f7f-deda49425fdb (at 207.123.49.37 at tcp) in 228
> seconds. I think it's dead, and I am evicting it.
>
> Oct 6 03:03:35 mint1502 kernel: Lustre: Skipped 3 previous similar messages
>
> Oct 6 03:04:02 mint1502 kernel: Lustre:
> 4712:0:(import.c:837:ptlrpc_connect_interpret())
> content-OST000a_UUID at 207.123.49.37
> <mailto:content-OST000a_UUID at 207.123.49.37>@tcp changed server handle
> from 0x25a7c6432e7fffec to 0x25a7c643369c7038
>
> Oct 6 03:04:02 mint1502 kernel: but is still in recovery
>
> Oct 6 03:04:02 mint1502 kernel: LustreError: 167-0: This client was
> evicted by content-OST000a; in progress operations using this service
> will fail.
>
> Oct 6 03:04:02 mint1502 kernel: LustreError: Skipped 1 previous similar
> message
>
> Oct 6 03:04:02 mint1502 kernel: Lustre:
> 4712:0:(import.c:837:ptlrpc_connect_interpret())
> content-OST000b_UUID at 207.123.49.37
> <mailto:content-OST000b_UUID at 207.123.49.37>@tcp changed server handle
> from 0x25a7c6432e7ffff3 to 0x25a7c643369c703f
>
> Oct 6 03:04:02 mint1502 kernel: but is still in recovery
>
> Oct 6 03:04:02 mint1502 kernel: Lustre:
> 19512:0:(quota_master.c:1644:mds_quota_recovery()) Only 58/57 OSTs are
> active, abort quota recovery
>
> Oct 6 03:04:02 mint1502 kernel: Lustre:
> 19512:0:(quota_master.c:1644:mds_quota_recovery()) Skipped 1 previous
> similar message
>
> Oct 6 03:04:02 mint1502 kernel: Lustre: content-OST000a-osc: Connection
> restored to service content-OST000a using nid 207.123.49.37 at tcp.
>
> Oct 6 03:04:02 mint1502 kernel: Lustre: Skipped 1 previous similar message
>
> Oct 6 03:04:02 mint1502 kernel: Lustre: MDS content-MDT0000:
> content-OST000a_UUID now active, resetting orphans
>
> Oct 6 03:04:02 mint1502 kernel: Lustre: Skipped 1 previous similar message
>
>
I've just seen it with a 1.8.1 server (and 1.6 clients):
Oct 4 18:21:30 sn01 kernel: CPU 5:
Oct 4 18:21:30 sn01 kernel: Modules linked in: mds(U) fsfilt_ldiskfs(U)
mgs(U) mgc(U) ldiskfs(U) crc16(U) lustre(U) lov(U) mdc(U) lquota(U)
osc(U) ksocklnd(U) ptlrpc(U) ob
dclass(U) lnet(U) lvfs(U) libcfs(U) autofs4(U) hidp(U) rfcomm(U)
l2cap(U) bluetooth(U) sunrpc(U) ipv6(U) xfrm_nalgo(U) crypto_api(U)
dm_mirror(U) dm_log(U) dm_multipath(U)
scsi_dh(U) dm_mod(U) video(U) hwmon(U) backlight(U) sbs(U) i2c_ec(U)
i2c_core(U) button(U) battery(U) asus_acpi(U) acpi_memhotplug(U) ac(U)
parport_pc(U) lp(U) parport(U) s
g(U) shpchp(U) netxen_nic(U) ide_cd(U) serio_raw(U) i5000_edac(U)
pcspkr(U) edac_mc(U) bnx2(U) cdrom(U) ata_piix(U) libata(U) mptsas(U)
mptscsih(U) scsi_transport_sas(U) mp
tbase(U) megaraid_sas(U) sd_mod(U) scsi_mod(U) ext3(U) jbd(U)
ehci_hcd(U) ohci_hcd(U) uhci_hcd(U)
Oct 4 18:21:30 sn01 kernel: Pid: 831, comm: ll_mdt_39 Tainted: G
2.6.18-128.1.14.el5_lustre.1.8.1 #1
Oct 4 18:21:30 sn01 kernel: RIP: 0010:[<ffffffff88504ba0>]
[<ffffffff88504ba0>] :obdclass:lustre_hash_for_each_empty+0x220/0x2b0
Oct 4 18:21:30 sn01 kernel: RSP: 0018:ffff810123af7a30 EFLAGS: 00000202
Oct 4 18:21:30 sn01 kernel: RAX: 000000000000ffff RBX: 0000000000002662
RCX: 00000000000083dc
Oct 4 18:21:30 sn01 kernel: RDX: 00000000000002dc RSI: ffffffff802f9e80
RDI: ffffc2001225361c
Oct 4 18:21:30 sn01 kernel: RBP: ffffffff885838e2 R08: ffff81000102fee0
R09: 0000000000000000
Oct 4 18:21:30 sn01 kernel: R10: ffff8100aa9d5cc0 R11: 0000000000000150
R12: ffff8100acf36800
Oct 4 18:21:30 sn01 kernel: R13: 0000000000000286 R14: 0000000000000286
R15: ffff810123af79d0
Oct 4 18:21:30 sn01 kernel: FS: 00002ab6a6887230(0000)
GS:ffff810106751740(0000) knlGS:0000000000000000
Oct 4 18:21:30 sn01 kernel: CS: 0010 DS: 0000 ES: 0000 CR0:
000000008005003b
Oct 4 18:21:30 sn01 kernel: CR2: 00002ab47fe89000 CR3: 0000000000201000
CR4: 00000000000006e0
Oct 4 18:21:30 sn01 kernel:
Oct 4 18:21:30 sn01 kernel: Call Trace:
Oct 4 18:21:30 sn01 kernel: [<ffffffff8850ba88>]
:obdclass:class_disconnect+0x398/0x420
Oct 4 18:21:30 sn01 kernel: [<ffffffff888a65b1>]
:mds:mds_disconnect+0x121/0xe40
Oct 4 18:21:30 sn01 kernel: [<ffffffff885c5ea4>]
:ptlrpc:lustre_msg_add_version+0x34/0x110
Oct 4 18:21:30 sn01 kernel: [<ffffffff885cb834>]
:ptlrpc:lustre_msg_set_timeout+0x34/0x110
Oct 4 18:21:30 sn01 kernel: [<ffffffff885c8a79>]
:ptlrpc:lustre_pack_reply+0x29/0xb0
Oct 4 18:21:30 sn01 kernel: [<ffffffff88591381>]
:ptlrpc:target_handle_disconnect+0x461/0x560
Oct 4 18:21:30 sn01 kernel: [<ffffffff885c4f35>]
:ptlrpc:lustre_msg_get_opc+0x35/0xf0
Oct 4 18:21:30 sn01 kernel: [<ffffffff888a2324>]
:mds:mds_handle+0xd74/0x4cb0
Oct 4 18:21:30 sn01 kernel: [<ffffffff80148d4f>] __next_cpu+0x19/0x28
Oct 4 18:21:30 sn01 kernel: [<ffffffff80088f32>]
find_busiest_group+0x20d/0x621
Oct 4 18:21:30 sn01 kernel: [<ffffffff885caa15>]
:ptlrpc:lustre_msg_get_conn_cnt+0x35/0xf0
Oct 4 18:21:30 sn01 kernel: [<ffffffff80089d89>] enqueue_task+0x41/0x56
Oct 4 18:21:30 sn01 kernel: [<ffffffff885cf72d>]
:ptlrpc:ptlrpc_check_req+0x1d/0x110
Oct 4 18:21:30 sn01 kernel: [<ffffffff885d1e67>]
:ptlrpc:ptlrpc_server_handle_request+0xa97/0x1160
Oct 4 18:21:30 sn01 kernel: [<ffffffff8003dc3f>] lock_timer_base+0x1b/0x3c
Oct 4 18:21:30 sn01 kernel: [<ffffffff80088819>]
__wake_up_common+0x3e/0x68
Oct 4 18:21:30 sn01 kernel: [<ffffffff885d5908>]
:ptlrpc:ptlrpc_main+0x1218/0x13e0
Oct 4 18:21:30 sn01 kernel: [<ffffffff8008a3ef>]
default_wake_function+0x0/0xe
Oct 4 18:21:30 sn01 kernel: [<ffffffff800b48dd>]
audit_syscall_exit+0x327/0x342
Oct 4 18:21:30 sn01 kernel: [<ffffffff8005dfb1>] child_rip+0xa/0x11
Oct 4 18:21:30 sn01 kernel: [<ffffffff885d46f0>]
:ptlrpc:ptlrpc_main+0x0/0x13e0
Oct 4 18:21:30 sn01 kernel: [<ffffffff8005dfa7>] child_rip+0x0/0x11
Oct 4 18:21:30 sn01 kernel:
Oct 4 18:21:40 sn01 kernel: BUG: soft lockup - CPU#5 stuck for 10s!
[ll_mdt_39:831]
Oct 4 18:21:40 sn01 kernel: CPU 5:
Oct 4 18:21:40 sn01 kernel: Modules linked in: mds(U) fsfilt_ldiskfs(U)
mgs(U) mgc(U) ldiskfs(U) crc16(U) lustre(U) lov(U) mdc(U) lquota(U)
osc(U) ksocklnd(U) ptlrpc(U) ob
dclass(U) lnet(U) lvfs(U) libcfs(U) autofs4(U) hidp(U) rfcomm(U)
l2cap(U) bluetooth(U) sunrpc(U) ipv6(U) xfrm_nalgo(U) crypto_api(U)
dm_mirror(U) dm_log(U) dm_multipath(U)
scsi_dh(U) dm_mod(U) video(U) hwmon(U) backlight(U) sbs(U) i2c_ec(U)
i2c_core(U) button(U) battery(U) asus_acpi(U) acpi_memhotplug(U) ac(U)
parport_pc(U) lp(U) parport(U) s
g(U) shpchp(U) netxen_nic(U) ide_cd(U) serio_raw(U) i5000_edac(U)
pcspkr(U) edac_mc(U) bnx2(U) cdrom(U) ata_piix(U) libata(U) mptsas(U)
mptscsih(U) scsi_transport_sas(U) mp
tbase(U) megaraid_sas(U) sd_mod(U) scsi_mod(U) ext3(U) jbd(U)
ehci_hcd(U) ohci_hcd(U) uhci_hcd(U)
Oct 4 18:21:40 sn01 kernel: Pid: 831, comm: ll_mdt_39 Tainted: G
2.6.18-128.1.14.el5_lustre.1.8.1 #1
Oct 4 18:21:40 sn01 kernel: RIP: 0010:[<ffffffff88504b98>]
[<ffffffff88504b98>] :obdclass:lustre_hash_for_each_empty+0x218/0x2b0
Oct 4 18:21:40 sn01 kernel: RSP: 0018:ffff810123af7a30 EFLAGS: 00000206
Oct 4 18:21:40 sn01 kernel: RAX: 000000000000ffff RBX: 0000000000000543
RCX: 000000000000bf15
Oct 4 18:21:40 sn01 kernel: RDX: 000000000000036e RSI: ffffffff802f9e80
RDI: ffffc2001223243c
Oct 4 18:21:40 sn01 kernel: RBP: ffffffff885838e2 R08: ffff81000102fee0
R09: 0000000000000000
Oct 4 18:21:40 sn01 kernel: R10: ffff8100a59c1e00 R11: 0000000000000150
R12: ffff8100a5f6f800
Oct 4 18:21:40 sn01 kernel: R13: 0000000000000286 R14: 0000000000000286
R15: ffff810123af79d0
Oct 4 18:21:40 sn01 kernel: FS: 00002ab6a6887230(0000)
GS:ffff810106751740(0000) knlGS:0000000000000000
Oct 4 18:21:40 sn01 kernel: CS: 0010 DS: 0000 ES: 0000 CR0:
000000008005003b
Oct 4 18:21:40 sn01 kernel: CR2: 00002ab47fe89000 CR3: 0000000000201000
CR4: 00000000000006e0
Oct 4 18:21:40 sn01 kernel:
Oct 4 18:21:40 sn01 kernel: [<ffffffff888a65b1>]
:mds:mds_disconnect+0x121/0xe40
Oct 4 18:21:40 sn01 kernel: [<ffffffff885c5ea4>]
:ptlrpc:lustre_msg_add_version+0x34/0x110
Oct 4 18:21:40 sn01 kernel: [<ffffffff885cb834>]
:ptlrpc:lustre_msg_set_timeout+0x34/0x110
Oct 4 18:21:40 sn01 kernel: [<ffffffff885c8a79>]
:ptlrpc:lustre_pack_reply+0x29/0xb0
Oct 4 18:21:40 sn01 kernel: [<ffffffff88591381>]
:ptlrpc:target_handle_disconnect+0x461/0x560
Oct 4 18:21:40 sn01 kernel: [<ffffffff885c4f35>]
:ptlrpc:lustre_msg_get_opc+0x35/0xf0
Oct 4 18:21:40 sn01 kernel: [<ffffffff888a2324>]
:mds:mds_handle+0xd74/0x4cb0
Oct 4 18:21:40 sn01 kernel: [<ffffffff80148d4f>] __next_cpu+0x19/0x28
Oct 4 18:21:40 sn01 kernel: [<ffffffff80088f32>]
find_busiest_group+0x20d/0x621
Oct 4 18:21:40 sn01 kernel: [<ffffffff885caa15>]
:ptlrpc:lustre_msg_get_conn_cnt+0x35/0xf0
Oct 4 18:21:40 sn01 kernel: [<ffffffff80089d89>] enqueue_task+0x41/0x56
Oct 4 18:21:40 sn01 kernel: [<ffffffff885cf72d>]
:ptlrpc:ptlrpc_check_req+0x1d/0x110
Oct 4 18:21:40 sn01 kernel: [<ffffffff885d1e67>]
:ptlrpc:ptlrpc_server_handle_request+0xa97/0x1160
Oct 4 18:21:40 sn01 kernel: [<ffffffff8003dc3f>] lock_timer_base+0x1b/0x3c
Oct 4 18:21:40 sn01 kernel: [<ffffffff80088819>]
__wake_up_common+0x3e/0x68
Oct 4 18:21:40 sn01 kernel: [<ffffffff885d5908>]
:ptlrpc:ptlrpc_main+0x1218/0x13e0
Oct 4 18:21:40 sn01 kernel: [<ffffffff8008a3ef>]
default_wake_function+0x0/0xe
Oct 4 18:21:40 sn01 kernel: [<ffffffff800b48dd>]
audit_syscall_exit+0x327/0x342
Oct 4 18:21:40 sn01 kernel: [<ffffffff8005dfb1>] child_rip+0xa/0x11
Oct 4 18:21:40 sn01 kernel: [<ffffffff885d46f0>]
:ptlrpc:ptlrpc_main+0x0/0x13e0
Oct 4 18:21:40 sn01 kernel: [<ffffffff8005dfa7>] child_rip+0x0/0x11
Oct 4 18:21:40 sn01 kernel:
s ago has timed out (limit 7s).
0/0 rc 0/0
will wait for recovery to complete.
Oct 4 18:21:48 sn01 kernel: Lustre: Skipped 1 previous similar message
Oct 4 18:21:48 sn01 kernel: Lustre: lustre_0-OST001c-osc: Connection
restored to service lustre_0-OST001c using nid 10.1.4.118 at tcp.
Oct 4 18:21:50 sn01 kernel: BUG: soft lockup - CPU#5 stuck for 10s!
[ll_mdt_39:831]
Oct 4 18:21:50 sn01 kernel: CPU 5:
tbase(U) megaraid_sas(U) sd_mod(U) scsi_mod(U) ext3(U) jbd(U)
ehci_hcd(U) ohci_hcd(U) uhci_hcd(U)
Oct 4 18:21:50 sn01 kernel: Pid: 831, comm: ll_mdt_39 Tainted: G
2.6.18-128.1.14.el5_lustre.1.8.1 #1
Oct 4 18:21:50 sn01 kernel: RIP: 0010:[<ffffffff88504ba0>]
[<ffffffff88504ba0>] :obdclass:lustre_hash_for_each_empty+0x220/0x2b0
Oct 4 18:21:50 sn01 kernel: RSP: 0018:ffff810123af7a30 EFLAGS: 00000202
Which repeats several times, before
Oct 4 18:24:31 sn01 kernel: Lustre:
831:0:(service.c:1369:ptlrpc_server_handle_request()) @@@ Request
x44330119 took longer than estimated (100+90s); client may timeout.
req at ffff810188454c00 x44330119/t0
o39->e64e59dd-a66e-9210-ed51-5bcdebd7470d at NET_0x200000a010713_UUID:0/0
lens 128/128 e 0 to 0 dl 1254676981 ref 1 fl Complete:/0/0 rc 0/0
Oct 4 18:24:31 sn01 kernel: LustreError:
4417:0:(events.c:66:request_out_callback()) @@@ type 4, status -5
req at ffff81009432b800 x1310439138925892/t0 o400->lustre_0-OST001
f_UUID at 10.1.4.118@tcp:28/4 lens 192/384 e 0 to 1 dl 1254676908 ref 1 fl
Complete:XN/0/0 rc -11/0
Oct 4 18:24:31 sn01 kernel: LustreError:
4417:0:(events.c:66:request_out_callback()) Skipped 6 previous similar
messages
Oct 4 18:24:31 sn01 kernel: Lustre:
864:0:(ldlm_lib.c:541:target_handle_reconnect()) lustre_0-MDT0000:
dbb6a3f0-9aca-15cb-c931-831d85f03526 reconnecting
Oct 4 18:24:31 sn01 kernel: Lustre:
885:0:(ldlm_lib.c:766:target_handle_connect()) lustre_0-MDT0000: exp
ffff8101888d4200 already connecting
Oct 4 18:24:31 sn01 kernel: LustreError:
885:0:(ldlm_lib.c:1850:target_send_reply_msg()) @@@ processing error
(-114) req at ffff8101ad1b6c00 x2087169/t0 o38->dbb6a3f0-9aca-1
5cb-c931-831d85f03526 at NET_0x200000a01070d_UUID:0/0 lens 304/200 e 0 to 0
dl 1254677171 ref 1 fl Interpret:/0/0 rc -114/0
Oct 4 18:24:31 sn01 kernel: Lustre:
4417:0:(lib-move.c:1782:lnet_parse_put()) Dropping PUT from
12345-10.1.4.118 at tcp portal 4 match 1310439138926075 offset 0 length 264: 2
Oct 4 18:24:31 sn01 kernel: Lustre:
4417:0:(lib-move.c:1782:lnet_parse_put()) Dropping PUT from
12345-10.1.4.118 at tcp portal 4 match 1310439138926074 offset 0 length 264: 2
Oct 4 18:24:31 sn01 kernel: Lustre:
4417:0:(lib-move.c:1782:lnet_parse_put()) Dropping PUT from
12345-10.1.4.118 at tcp portal 4 match 1310439138926073 offset 0 length 264: 2
Oct 4 18:24:31 sn01 kernel: Lustre:
4417:0:(lib-move.c:1782:lnet_parse_put()) Dropping PUT from
12345-10.1.4.118 at tcp portal 4 match 1310439138926133 offset 0 length 264: 2
Oct 4 18:24:31 sn01 kernel: Lustre:
4417:0:(lib-move.c:1782:lnet_parse_put()) Dropping PUT from
12345-10.1.4.118 at tcp portal 4 match 1310439138926135 offset 0 length 264: 2
Oct 4 18:24:31 sn01 kernel: Lustre:
4417:0:(lib-move.c:1782:lnet_parse_put()) Dropping PUT from
12345-10.1.4.118 at tcp portal 4 match 1310439138926136 offset 0 length 264: 2
Oct 4 18:24:31 sn01 kernel: Lustre:
4417:0:(lib-move.c:1782:lnet_parse_put()) Dropping PUT from
12345-10.1.4.118 at tcp portal 4 match 1310439138926134 offset 0 length 264: 2
Oct 4 18:24:31 sn01 kernel: Lustre:
4417:0:(lib-move.c:1782:lnet_parse_put()) Dropping PUT from
12345-10.1.4.118 at tcp portal 4 match 1310439138926184 offset 0 length 264: 2
Oct 4 18:24:31 sn01 kernel: Lustre:
4417:0:(lib-move.c:1782:lnet_parse_put()) Dropping PUT from
12345-10.1.4.118 at tcp portal 4 match 1310439138926186 offset 0 length 264: 2
Oct 4 18:24:31 sn01 kernel: Lustre:
4417:0:(lib-move.c:1782:lnet_parse_put()) Dropping PUT from
12345-10.1.4.118 at tcp portal 4 match 1310439138926185 offset 0 length 264: 2
Oct 4 18:24:31 sn01 kernel: Lustre:
4417:0:(lib-move.c:1782:lnet_parse_put()) Dropping PUT from
12345-10.1.4.118 at tcp portal 4 match 1310439138926187 offset 0 length 264: 2
Oct 4 18:24:31 sn01 kernel: Lustre: lustre_0-OST001c-osc: Connection
restored to service lustre_0-OST001c using nid 10.1.4.118 at tcp.
Chris
More information about the lustre-discuss
mailing list