[Lustre-discuss] 1.8.0 Loosing connection to the MDT for several minutes and then recovering.

Đào Thị Thảo thaodt at isds.vn
Tue Oct 6 20:43:27 PDT 2009


hi,
I have the same problem with Lundgren.
I don't understand why it's happen. While, my network is still stable.
This problem is repeated. Some time, client can't connection restored to
service (OSTs or MDS) . My  provisional measure is rebooting the node.
Andrew, can you explain more detail and guide how to fix it?

On Wed, Oct 7, 2009 at 12:20 AM, Lundgren, Andrew <
Andrew.Lundgren at level3.com> wrote:

>  Oh man, that should have read LOSING!
>
>
>
> *From:* lustre-discuss-bounces at lists.lustre.org [mailto:
> lustre-discuss-bounces at lists.lustre.org] *On Behalf Of *Lundgren, Andrew
> *Sent:* Tuesday, October 06, 2009 11:14 AM
> *To:* lustre-discuss
> *Subject:* [Lustre-discuss] 1.8.0 Loosing connection to the MDT for
> several minutes and then recovering.
>
>
>
> We have a few 1.8.0 clusters running.  We have seen multiple instances now
> where the clients loose connectivity to the MDT.  The MDS logs indicate that
> there is some sort of problem on the MDT.
>
>
>
> The following is a typical output:
>
>
>
> Oct  6 02:56:08 mint1502 kernel: LustreError:
> 28999:0:(handler.c:161:mds_sendpage()) @@@ bulk failed: timeout 0(4096),
> evicting 7523f416-2579-5f49-cd3f-54d2438b84bb at NET_0x20000ce213b0b_UUID
>
> Oct  6 02:56:08 mint1502 kernel:   req at ffff8100ac9f4000x1314647461000449/t0
> o37->7523f416-2579-5f49-cd3f-54d2438b84bb at NET_0x20000ce213b0b_UUID:0/0
> lens 408/360 e 1 to 0 dl 1254797793 ref 1 fl Interpret:/0/0 rc 0/0
>
> Oct  6 02:56:16 mint1502 kernel: Lustre: Request x1312747398000879 sent
> from content-OST001d-osc to NID 207.123.49.16 at tcp 7s ago has timed out
> (limit 7s).
>
> Oct  6 02:56:16 mint1502 kernel: LustreError: 166-1: content-OST001d-osc:
> Connection to service content-OST001d via nid 207.123.49.16 at tcp was lost;
> in progress operations using this service will fail.
>
> Oct  6 02:56:16 mint1502 kernel: LustreError: Skipped 1 previous similar
> message
>
> Oct  6 02:56:17 mint1502 kernel: LustreError: 166-1: content-OST001c-osc:
> Connection to service content-OST001c via nid 207.123.49.16 at tcp was lost;
> in progress operations using this service will fail.
>
> Oct  6 02:56:17 mint1502 kernel: LustreError: Skipped 1 previous similar
> message
>
> Oct  6 02:56:18 mint1502 kernel: LustreError: 138-a: content-MDT0000: A
> client on nid 207.123.49.16 at tcp was evicted due to a lock blocking
> callback to 207.123.49.16 at tcp timed out: rc -107
>
> Oct  6 02:56:18 mint1502 kernel: LustreError: 138-a: content-MDT0000: A
> client on nid 207.123.49.9 at tcp was evicted due to a lock blocking callback
> to 207.123.49.9 at tcp timed out: rc -107
>
> Oct  6 02:56:18 mint1502 kernel: BUG: soft lockup - CPU#2 stuck for 10s!
> [ll_mdt_rdpg_04:28999]
>
> Oct  6 02:56:18 mint1502 kernel: CPU 2:
>
> Oct  6 02:56:18 mint1502 kernel: Modules linked in: mds(U)
> fsfilt_ldiskfs(U) mgs(U) mgc(U) ldiskfs(U) crc16(U) lustre(U) lov(U) mdc(U)
> lquota(U) osc(U) ksocklnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) libcfs(U)
> ipv6(U) xfrm_nalgo(U) crypto_api(U) sunrpc(U) bonding(U)
> ip_conntrack_netbios_ns(U) ipt_REJECT(U) xt_tcpudp(U) xt_state(U)
> ip_conntrack(U) nfnetlink(U) iptable_filter(U) ip_tables(U) x_tables(U)
> dm_round_robin(U) dm_rdac(U) dm_multipath(U) video(U) sbs(U) backlight(U)
> i2c_ec(U) button(U) battery(U) asus_acpi(U) acpi_memhotplug(U) ac(U)
> parport_pc(U) lp(U) parport(U) joydev(U) i2c_i801(U) e1000e(U) sr_mod(U)
> i2c_core(U) i5000_edac(U) cdrom(U) pata_acpi(U) shpchp(U) edac_mc(U)
> serio_raw(U) sg(U) pcspkr(U) dm_snapshot(U) dm_zero(U) dm_mirror(U)
> dm_mod(U) usb_storage(U) lpfc(U) scsi_transport_fc(U) ahci(U) ata_piix(U)
> libata(U) mptsas(U) mptscsih(U) mptbase(U) scsi_transport_sas(U) sd_mod(U)
> scsi_mod(U) ext3(U) jbd(U) uhci_hcd(U) ohci_hcd(U) ehci_hcd(U)
>
> Oct  6 02:56:18 mint1502 kernel: Pid: 28999, comm: ll_mdt_rdpg_04 Tainted:
> G      2.6.18-92.1.17.el5_lustre.1.8.0smp #1
>
> Oct  6 02:56:18 mint1502 kernel: RIP: 0010:[<ffffffff8858f470>]
> [<ffffffff8858f470>] :obdclass:lustre_hash_for_each_empty+0x1f0/0x290
>
> Oct  6 02:56:18 mint1502 kernel: RSP: 0018:ffff8104c02ad850  EFLAGS:
> 00000206
>
> Oct  6 02:56:18 mint1502 kernel: RAX: ffff810448dfd200 RBX:
> 000000000000328c RCX: 000000000000ba75
>
> Oct  6 02:56:18 mint1502 kernel: RDX: ffffffffffff5e7d RSI:
> ffffffff802f0d80 RDI: ffffc200109d78cc
>
> Oct  6 02:56:18 mint1502 kernel: RBP: ffffffff8860c8f2 R08:
> ffff810001016e60 R09: 0000000000000000
>
> Oct  6 02:56:18 mint1502 kernel: R10: ffff8100b7a28500 R11:
> 0000000000000150 R12: ffff810448dfd200
>
> Oct  6 02:56:18 mint1502 kernel: R13: 0000000000000286 R14:
> 0000000000000286 R15: ffff8104c02ad7f0
>
> Oct  6 02:56:18 mint1502 kernel: FS:  00002b49ff256220(0000)
> GS:ffff81051fc53d40(0000) knlGS:0000000000000000
>
> Oct  6 02:56:18 mint1502 kernel: CS:  0010 DS: 0000 ES: 0000 CR0:
> 000000008005003b
>
> Oct  6 02:56:18 mint1502 kernel: CR2: 0000003d82e985d0 CR3:
> 0000000000201000 CR4: 00000000000006e0
>
> Oct  6 02:56:18 mint1502 kernel:
>
> Oct  6 02:56:18 mint1502 kernel: Call Trace:
>
> Oct  6 02:56:18 mint1502 kernel:  [<ffffffff885962b8>]
> :obdclass:class_disconnect+0x378/0x400
>
> Oct  6 02:56:18 mint1502 kernel:  [<ffffffff8860e590>]
> :ptlrpc:ldlm_cancel_locks_for_export_cb+0x0/0xc0
>
> Oct  6 02:56:18 mint1502 kernel:  [<ffffffff8891b301>]
> :mds:mds_disconnect+0x121/0xe30
>
> Oct  6 02:56:18 mint1502 kernel:  [<ffffffff8003d514>]
> lock_timer_base+0x1b/0x3c
>
> Oct  6 02:56:19 mint1502 kernel:  [<ffffffff8004a824>]
> try_to_del_timer_sync+0x51/0x5a
>
> Oct  6 02:56:19 mint1502 kernel:  [<ffffffff885920e4>]
> :obdclass:class_fail_export+0x384/0x4c0
>
> Oct  6 02:56:19 mint1502 kernel:  [<ffffffff889150d1>]
> :mds:mds_readpage+0x1571/0x18a0
>
> Oct  6 02:56:19 mint1502 kernel:  [<ffffffff886538d5>]
> :ptlrpc:lustre_msg_set_limit+0x35/0xf0
>
> Oct  6 02:56:19 mint1502 kernel:  [<ffffffff8008abb9>]
> default_wake_function+0x0/0xe
>
> Oct  6 02:56:19 mint1502 kernel:  [<ffffffff88918470>]
> :mds:mds_handle+0x2140/0x4c80
>
> Oct  6 02:56:19 mint1502 kernel:  [<ffffffff80143809>] __next_cpu+0x19/0x28
>
> Oct  6 02:56:19 mint1502 kernel:  [<ffffffff886539c5>]
> :ptlrpc:lustre_msg_get_conn_cnt+0x35/0xf0
>
> Oct  6 02:56:19 mint1502 kernel:  [<ffffffff8865867d>]
> :ptlrpc:ptlrpc_check_req+0x1d/0x110
>
> Oct  6 02:56:19 mint1502 kernel:  [<ffffffff8865ac03>]
> :ptlrpc:ptlrpc_server_handle_request+0xa93/0x1160
>
> Oct  6 02:56:19 mint1502 kernel:  [<ffffffff80062f4b>]
> thread_return+0x0/0xdf
>
> Oct  6 02:56:19 mint1502 kernel:  [<ffffffff800891f6>]
> __wake_up_common+0x3e/0x68
>
> Oct  6 02:56:19 mint1502 kernel:  [<ffffffff8865e138>]
> :ptlrpc:ptlrpc_main+0x1218/0x13e0
>
> Oct  6 02:56:19 mint1502 kernel:  [<ffffffff8008abb9>]
> default_wake_function+0x0/0xe
>
> Oct  6 02:56:19 mint1502 kernel:  [<ffffffff800b4382>]
> audit_syscall_exit+0x31b/0x336
>
> Oct  6 02:56:19 mint1502 kernel:  [<ffffffff8005dfb1>] child_rip+0xa/0x11
>
> Oct  6 02:56:19 mint1502 kernel:  [<ffffffff8865cf20>]
> :ptlrpc:ptlrpc_main+0x0/0x13e0
>
> Oct  6 02:56:19 mint1502 kernel:  [<ffffffff8005dfa7>] child_rip+0x0/0x11
>
> Oct  6 02:56:19 mint1502 kernel:
>
>
>
> Those go on for a while until it recovers:
>
>
>
> Oct  6 03:01:22 mint1502 kernel: LustreError: Skipped 1 previous similar
> message
>
> Oct  6 03:02:48 mint1502 kernel: Lustre: Request x1312747398018956 sent
> from content-OST000a-osc to NID 207.123.49.37 at tcp 26s ago has timed out
> (limit 26s).
>
> Oct  6 03:02:48 mint1502 kernel: Lustre: Skipped 14 previous similar
> messages
>
> Oct  6 03:03:12 mint1502 kernel: Lustre:
> 4713:0:(import.c:508:import_select_connection()) content-OST000a-osc: tried
> all connections, increasing latency to 26s
>
> Oct  6 03:03:12 mint1502 kernel: Lustre:
> 4713:0:(import.c:508:import_select_connection()) Skipped 9 previous similar
> messages
>
> Oct  6 03:03:35 mint1502 kernel: Lustre: MGS: haven't heard from client
> e5534e18-4e9d-cbf4-7f7f-deda49425fdb (at 207.123.49.37 at tcp) in 228
> seconds. I think it's dead, and I am evicting it.
>
> Oct  6 03:03:35 mint1502 kernel: Lustre: Skipped 3 previous similar
> messages
>
> Oct  6 03:04:02 mint1502 kernel: Lustre:
> 4712:0:(import.c:837:ptlrpc_connect_interpret())
> content-OST000a_UUID at 207.123.49.37@tcp changed server handle from
> 0x25a7c6432e7fffec to 0x25a7c643369c7038
>
> Oct  6 03:04:02 mint1502 kernel: but is still in recovery
>
> Oct  6 03:04:02 mint1502 kernel: LustreError: 167-0: This client was
> evicted by content-OST000a; in progress operations using this service will
> fail.
>
> Oct  6 03:04:02 mint1502 kernel: LustreError: Skipped 1 previous similar
> message
>
> Oct  6 03:04:02 mint1502 kernel: Lustre:
> 4712:0:(import.c:837:ptlrpc_connect_interpret())
> content-OST000b_UUID at 207.123.49.37@tcp changed server handle from
> 0x25a7c6432e7ffff3 to 0x25a7c643369c703f
>
> Oct  6 03:04:02 mint1502 kernel: but is still in recovery
>
> Oct  6 03:04:02 mint1502 kernel: Lustre:
> 19512:0:(quota_master.c:1644:mds_quota_recovery()) Only 58/57 OSTs are
> active, abort quota recovery
>
> Oct  6 03:04:02 mint1502 kernel: Lustre:
> 19512:0:(quota_master.c:1644:mds_quota_recovery()) Skipped 1 previous
> similar message
>
> Oct  6 03:04:02 mint1502 kernel: Lustre: content-OST000a-osc: Connection
> restored to service content-OST000a using nid 207.123.49.37 at tcp.
>
> Oct  6 03:04:02 mint1502 kernel: Lustre: Skipped 1 previous similar message
>
> Oct  6 03:04:02 mint1502 kernel: Lustre: MDS content-MDT0000:
> content-OST000a_UUID now active, resetting orphans
>
> Oct  6 03:04:02 mint1502 kernel: Lustre: Skipped 1 previous similar message
>
>
>
> Anyone seen this before?
>
>
>
> --
>
> Andrew
>
>
>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20091007/ad3035e5/attachment.htm>


More information about the lustre-discuss mailing list