[Lustre-discuss] 1.8.0 Losing connection to the MDT for several minutes and then recovering.

Lundgren, Andrew Andrew.Lundgren at Level3.com
Tue Oct 6 22:13:37 PDT 2009


I haven’t done anything yet.  The machines seem to reconnect without intervention.   But the problem occurs again later.

From: Đào Thị Thảo [mailto:thaodt at isds.vn]
Sent: Tuesday, October 06, 2009 9:43 PM
To: Lundgren, Andrew
Cc: lustre-discuss
Subject: Re: [Lustre-discuss] 1.8.0 Loosing connection to the MDT for several minutes and then recovering.

hi,
I have the same problem with Lundgren.
I don't understand why it's happen. While, my network is still stable.
This problem is repeated. Some time, client can't connection restored to service (OSTs or MDS) . My  provisional measure is rebooting the node.
Andrew, can you explain more detail and guide how to fix it?
On Wed, Oct 7, 2009 at 12:20 AM, Lundgren, Andrew <Andrew.Lundgren at level3.com<mailto:Andrew.Lundgren at level3.com>> wrote:

Oh man, that should have read LOSING!



From: lustre-discuss-bounces at lists.lustre.org<mailto:lustre-discuss-bounces at lists.lustre.org> [mailto:lustre-discuss-bounces at lists.lustre.org<mailto:lustre-discuss-bounces at lists.lustre.org>] On Behalf Of Lundgren, Andrew
Sent: Tuesday, October 06, 2009 11:14 AM
To: lustre-discuss
Subject: [Lustre-discuss] 1.8.0 Loosing connection to the MDT for several minutes and then recovering.



We have a few 1.8.0 clusters running.  We have seen multiple instances now where the clients lose connectivity to the MDT.  The MDS logs indicate that there is some sort of problem on the MDT.



The following is a typical output:



Oct  6 02:56:08 mint1502 kernel: LustreError: 28999:0:(handler.c:161:mds_sendpage()) @@@ bulk failed: timeout 0(4096), evicting 7523f416-2579-5f49-cd3f-54d2438b84bb at NET_0x20000ce213b0b_UUID

Oct  6 02:56:08 mint1502 kernel:   req at ffff8100ac9f4000 x1314647461000449/t0 o37->7523f416-2579-5f49-cd3f-54d2438b84bb at NET_0x20000ce213b0b_UUID:0/0 lens 408/360 e 1 to 0 dl 1254797793 ref 1 fl Interpret:/0/0 rc 0/0

Oct  6 02:56:16 mint1502 kernel: Lustre: Request x1312747398000879 sent from content-OST001d-osc to NID 207.123.49.16 at tcp 7s ago has timed out (limit 7s).

Oct  6 02:56:16 mint1502 kernel: LustreError: 166-1: content-OST001d-osc: Connection to service content-OST001d via nid 207.123.49.16 at tcp was lost; in progress operations using this service will fail.

Oct  6 02:56:16 mint1502 kernel: LustreError: Skipped 1 previous similar message

Oct  6 02:56:17 mint1502 kernel: LustreError: 166-1: content-OST001c-osc: Connection to service content-OST001c via nid 207.123.49.16 at tcp was lost; in progress operations using this service will fail.

Oct  6 02:56:17 mint1502 kernel: LustreError: Skipped 1 previous similar message

Oct  6 02:56:18 mint1502 kernel: LustreError: 138-a: content-MDT0000: A client on nid 207.123.49.16 at tcp was evicted due to a lock blocking callback to 207.123.49.16 at tcp timed out: rc -107

Oct  6 02:56:18 mint1502 kernel: LustreError: 138-a: content-MDT0000: A client on nid 207.123.49.9 at tcp was evicted due to a lock blocking callback to 207.123.49.9 at tcp timed out: rc -107

Oct  6 02:56:18 mint1502 kernel: BUG: soft lockup - CPU#2 stuck for 10s! [ll_mdt_rdpg_04:28999]

Oct  6 02:56:18 mint1502 kernel: CPU 2:

Oct  6 02:56:18 mint1502 kernel: Modules linked in: mds(U) fsfilt_ldiskfs(U) mgs(U) mgc(U) ldiskfs(U) crc16(U) lustre(U) lov(U) mdc(U) lquota(U) osc(U) ksocklnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) libcfs(U) ipv6(U) xfrm_nalgo(U) crypto_api(U) sunrpc(U) bonding(U) ip_conntrack_netbios_ns(U) ipt_REJECT(U) xt_tcpudp(U) xt_state(U) ip_conntrack(U) nfnetlink(U) iptable_filter(U) ip_tables(U) x_tables(U) dm_round_robin(U) dm_rdac(U) dm_multipath(U) video(U) sbs(U) backlight(U) i2c_ec(U) button(U) battery(U) asus_acpi(U) acpi_memhotplug(U) ac(U) parport_pc(U) lp(U) parport(U) joydev(U) i2c_i801(U) e1000e(U) sr_mod(U) i2c_core(U) i5000_edac(U) cdrom(U) pata_acpi(U) shpchp(U) edac_mc(U) serio_raw(U) sg(U) pcspkr(U) dm_snapshot(U) dm_zero(U) dm_mirror(U) dm_mod(U) usb_storage(U) lpfc(U) scsi_transport_fc(U) ahci(U) ata_piix(U) libata(U) mptsas(U) mptscsih(U) mptbase(U) scsi_transport_sas(U) sd_mod(U) scsi_mod(U) ext3(U) jbd(U) uhci_hcd(U) ohci_hcd(U) ehci_hcd(U)

Oct  6 02:56:18 mint1502 kernel: Pid: 28999, comm: ll_mdt_rdpg_04 Tainted: G      2.6.18-92.1.17.el5_lustre.1.8.0smp #1

Oct  6 02:56:18 mint1502 kernel: RIP: 0010:[<ffffffff8858f470>]  [<ffffffff8858f470>] :obdclass:lustre_hash_for_each_empty+0x1f0/0x290

Oct  6 02:56:18 mint1502 kernel: RSP: 0018:ffff8104c02ad850  EFLAGS: 00000206

Oct  6 02:56:18 mint1502 kernel: RAX: ffff810448dfd200 RBX: 000000000000328c RCX: 000000000000ba75

Oct  6 02:56:18 mint1502 kernel: RDX: ffffffffffff5e7d RSI: ffffffff802f0d80 RDI: ffffc200109d78cc

Oct  6 02:56:18 mint1502 kernel: RBP: ffffffff8860c8f2 R08: ffff810001016e60 R09: 0000000000000000

Oct  6 02:56:18 mint1502 kernel: R10: ffff8100b7a28500 R11: 0000000000000150 R12: ffff810448dfd200

Oct  6 02:56:18 mint1502 kernel: R13: 0000000000000286 R14: 0000000000000286 R15: ffff8104c02ad7f0

Oct  6 02:56:18 mint1502 kernel: FS:  00002b49ff256220(0000) GS:ffff81051fc53d40(0000) knlGS:0000000000000000

Oct  6 02:56:18 mint1502 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b

Oct  6 02:56:18 mint1502 kernel: CR2: 0000003d82e985d0 CR3: 0000000000201000 CR4: 00000000000006e0

Oct  6 02:56:18 mint1502 kernel:

Oct  6 02:56:18 mint1502 kernel: Call Trace:

Oct  6 02:56:18 mint1502 kernel:  [<ffffffff885962b8>] :obdclass:class_disconnect+0x378/0x400

Oct  6 02:56:18 mint1502 kernel:  [<ffffffff8860e590>] :ptlrpc:ldlm_cancel_locks_for_export_cb+0x0/0xc0

Oct  6 02:56:18 mint1502 kernel:  [<ffffffff8891b301>] :mds:mds_disconnect+0x121/0xe30

Oct  6 02:56:18 mint1502 kernel:  [<ffffffff8003d514>] lock_timer_base+0x1b/0x3c

Oct  6 02:56:19 mint1502 kernel:  [<ffffffff8004a824>] try_to_del_timer_sync+0x51/0x5a

Oct  6 02:56:19 mint1502 kernel:  [<ffffffff885920e4>] :obdclass:class_fail_export+0x384/0x4c0

Oct  6 02:56:19 mint1502 kernel:  [<ffffffff889150d1>] :mds:mds_readpage+0x1571/0x18a0

Oct  6 02:56:19 mint1502 kernel:  [<ffffffff886538d5>] :ptlrpc:lustre_msg_set_limit+0x35/0xf0

Oct  6 02:56:19 mint1502 kernel:  [<ffffffff8008abb9>] default_wake_function+0x0/0xe

Oct  6 02:56:19 mint1502 kernel:  [<ffffffff88918470>] :mds:mds_handle+0x2140/0x4c80

Oct  6 02:56:19 mint1502 kernel:  [<ffffffff80143809>] __next_cpu+0x19/0x28

Oct  6 02:56:19 mint1502 kernel:  [<ffffffff886539c5>] :ptlrpc:lustre_msg_get_conn_cnt+0x35/0xf0

Oct  6 02:56:19 mint1502 kernel:  [<ffffffff8865867d>] :ptlrpc:ptlrpc_check_req+0x1d/0x110

Oct  6 02:56:19 mint1502 kernel:  [<ffffffff8865ac03>] :ptlrpc:ptlrpc_server_handle_request+0xa93/0x1160

Oct  6 02:56:19 mint1502 kernel:  [<ffffffff80062f4b>] thread_return+0x0/0xdf

Oct  6 02:56:19 mint1502 kernel:  [<ffffffff800891f6>] __wake_up_common+0x3e/0x68

Oct  6 02:56:19 mint1502 kernel:  [<ffffffff8865e138>] :ptlrpc:ptlrpc_main+0x1218/0x13e0

Oct  6 02:56:19 mint1502 kernel:  [<ffffffff8008abb9>] default_wake_function+0x0/0xe

Oct  6 02:56:19 mint1502 kernel:  [<ffffffff800b4382>] audit_syscall_exit+0x31b/0x336

Oct  6 02:56:19 mint1502 kernel:  [<ffffffff8005dfb1>] child_rip+0xa/0x11

Oct  6 02:56:19 mint1502 kernel:  [<ffffffff8865cf20>] :ptlrpc:ptlrpc_main+0x0/0x13e0

Oct  6 02:56:19 mint1502 kernel:  [<ffffffff8005dfa7>] child_rip+0x0/0x11

Oct  6 02:56:19 mint1502 kernel:



Those go on for a while until it recovers:



Oct  6 03:01:22 mint1502 kernel: LustreError: Skipped 1 previous similar message

Oct  6 03:02:48 mint1502 kernel: Lustre: Request x1312747398018956 sent from content-OST000a-osc to NID 207.123.49.37 at tcp 26s ago has timed out (limit 26s).

Oct  6 03:02:48 mint1502 kernel: Lustre: Skipped 14 previous similar messages

Oct  6 03:03:12 mint1502 kernel: Lustre: 4713:0:(import.c:508:import_select_connection()) content-OST000a-osc: tried all connections, increasing latency to 26s

Oct  6 03:03:12 mint1502 kernel: Lustre: 4713:0:(import.c:508:import_select_connection()) Skipped 9 previous similar messages

Oct  6 03:03:35 mint1502 kernel: Lustre: MGS: haven't heard from client e5534e18-4e9d-cbf4-7f7f-deda49425fdb (at 207.123.49.37 at tcp) in 228 seconds. I think it's dead, and I am evicting it.

Oct  6 03:03:35 mint1502 kernel: Lustre: Skipped 3 previous similar messages

Oct  6 03:04:02 mint1502 kernel: Lustre: 4712:0:(import.c:837:ptlrpc_connect_interpret()) content-OST000a_UUID at 207.123.49.37<mailto:content-OST000a_UUID at 207.123.49.37>@tcp changed server handle from 0x25a7c6432e7fffec to 0x25a7c643369c7038

Oct  6 03:04:02 mint1502 kernel: but is still in recovery

Oct  6 03:04:02 mint1502 kernel: LustreError: 167-0: This client was evicted by content-OST000a; in progress operations using this service will fail.

Oct  6 03:04:02 mint1502 kernel: LustreError: Skipped 1 previous similar message

Oct  6 03:04:02 mint1502 kernel: Lustre: 4712:0:(import.c:837:ptlrpc_connect_interpret()) content-OST000b_UUID at 207.123.49.37<mailto:content-OST000b_UUID at 207.123.49.37>@tcp changed server handle from 0x25a7c6432e7ffff3 to 0x25a7c643369c703f

Oct  6 03:04:02 mint1502 kernel: but is still in recovery

Oct  6 03:04:02 mint1502 kernel: Lustre: 19512:0:(quota_master.c:1644:mds_quota_recovery()) Only 58/57 OSTs are active, abort quota recovery

Oct  6 03:04:02 mint1502 kernel: Lustre: 19512:0:(quota_master.c:1644:mds_quota_recovery()) Skipped 1 previous similar message

Oct  6 03:04:02 mint1502 kernel: Lustre: content-OST000a-osc: Connection restored to service content-OST000a using nid 207.123.49.37 at tcp.

Oct  6 03:04:02 mint1502 kernel: Lustre: Skipped 1 previous similar message

Oct  6 03:04:02 mint1502 kernel: Lustre: MDS content-MDT0000: content-OST000a_UUID now active, resetting orphans

Oct  6 03:04:02 mint1502 kernel: Lustre: Skipped 1 previous similar message



Anyone seen this before?



--

Andrew



_______________________________________________
Lustre-discuss mailing list
Lustre-discuss at lists.lustre.org<mailto:Lustre-discuss at lists.lustre.org>
http://lists.lustre.org/mailman/listinfo/lustre-discuss


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20091006/19a38faf/attachment.htm>


More information about the lustre-discuss mailing list