[Lustre-discuss] 1.8.0 Loosing connection to the MDT for several minutes and then recovering.

Lundgren, Andrew Andrew.Lundgren at Level3.com
Tue Oct 6 10:14:20 PDT 2009


We have a few 1.8.0 clusters running.  We have seen multiple instances now where the clients loose connectivity to the MDT.  The MDS logs indicate that there is some sort of problem on the MDT.

The following is a typical output:

Oct  6 02:56:08 mint1502 kernel: LustreError: 28999:0:(handler.c:161:mds_sendpage()) @@@ bulk failed: timeout 0(4096), evicting 7523f416-2579-5f49-cd3f-54d2438b84bb at NET_0x20000ce213b0b_UUID
Oct  6 02:56:08 mint1502 kernel:   req at ffff8100ac9f4000 x1314647461000449/t0 o37->7523f416-2579-5f49-cd3f-54d2438b84bb at NET_0x20000ce213b0b_UUID:0/0 lens 408/360 e 1 to 0 dl 1254797793 ref 1 fl Interpret:/0/0 rc 0/0
Oct  6 02:56:16 mint1502 kernel: Lustre: Request x1312747398000879 sent from content-OST001d-osc to NID 207.123.49.16 at tcp 7s ago has timed out (limit 7s).
Oct  6 02:56:16 mint1502 kernel: LustreError: 166-1: content-OST001d-osc: Connection to service content-OST001d via nid 207.123.49.16 at tcp was lost; in progress operations using this service will fail.
Oct  6 02:56:16 mint1502 kernel: LustreError: Skipped 1 previous similar message
Oct  6 02:56:17 mint1502 kernel: LustreError: 166-1: content-OST001c-osc: Connection to service content-OST001c via nid 207.123.49.16 at tcp was lost; in progress operations using this service will fail.
Oct  6 02:56:17 mint1502 kernel: LustreError: Skipped 1 previous similar message
Oct  6 02:56:18 mint1502 kernel: LustreError: 138-a: content-MDT0000: A client on nid 207.123.49.16 at tcp was evicted due to a lock blocking callback to 207.123.49.16 at tcp timed out: rc -107
Oct  6 02:56:18 mint1502 kernel: LustreError: 138-a: content-MDT0000: A client on nid 207.123.49.9 at tcp was evicted due to a lock blocking callback to 207.123.49.9 at tcp timed out: rc -107
Oct  6 02:56:18 mint1502 kernel: BUG: soft lockup - CPU#2 stuck for 10s! [ll_mdt_rdpg_04:28999]
Oct  6 02:56:18 mint1502 kernel: CPU 2:
Oct  6 02:56:18 mint1502 kernel: Modules linked in: mds(U) fsfilt_ldiskfs(U) mgs(U) mgc(U) ldiskfs(U) crc16(U) lustre(U) lov(U) mdc(U) lquota(U) osc(U) ksocklnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) libcfs(U) ipv6(U) xfrm_nalgo(U) crypto_api(U) sunrpc(U) bonding(U) ip_conntrack_netbios_ns(U) ipt_REJECT(U) xt_tcpudp(U) xt_state(U) ip_conntrack(U) nfnetlink(U) iptable_filter(U) ip_tables(U) x_tables(U) dm_round_robin(U) dm_rdac(U) dm_multipath(U) video(U) sbs(U) backlight(U) i2c_ec(U) button(U) battery(U) asus_acpi(U) acpi_memhotplug(U) ac(U) parport_pc(U) lp(U) parport(U) joydev(U) i2c_i801(U) e1000e(U) sr_mod(U) i2c_core(U) i5000_edac(U) cdrom(U) pata_acpi(U) shpchp(U) edac_mc(U) serio_raw(U) sg(U) pcspkr(U) dm_snapshot(U) dm_zero(U) dm_mirror(U) dm_mod(U) usb_storage(U) lpfc(U) scsi_transport_fc(U) ahci(U) ata_piix(U) libata(U) mptsas(U) mptscsih(U) mptbase(U) scsi_transport_sas(U) sd_mod(U) scsi_mod(U) ext3(U) jbd(U) uhci_hcd(U) ohci_hcd(U) ehci_hcd(U)
Oct  6 02:56:18 mint1502 kernel: Pid: 28999, comm: ll_mdt_rdpg_04 Tainted: G      2.6.18-92.1.17.el5_lustre.1.8.0smp #1
Oct  6 02:56:18 mint1502 kernel: RIP: 0010:[<ffffffff8858f470>]  [<ffffffff8858f470>] :obdclass:lustre_hash_for_each_empty+0x1f0/0x290
Oct  6 02:56:18 mint1502 kernel: RSP: 0018:ffff8104c02ad850  EFLAGS: 00000206
Oct  6 02:56:18 mint1502 kernel: RAX: ffff810448dfd200 RBX: 000000000000328c RCX: 000000000000ba75
Oct  6 02:56:18 mint1502 kernel: RDX: ffffffffffff5e7d RSI: ffffffff802f0d80 RDI: ffffc200109d78cc
Oct  6 02:56:18 mint1502 kernel: RBP: ffffffff8860c8f2 R08: ffff810001016e60 R09: 0000000000000000
Oct  6 02:56:18 mint1502 kernel: R10: ffff8100b7a28500 R11: 0000000000000150 R12: ffff810448dfd200
Oct  6 02:56:18 mint1502 kernel: R13: 0000000000000286 R14: 0000000000000286 R15: ffff8104c02ad7f0
Oct  6 02:56:18 mint1502 kernel: FS:  00002b49ff256220(0000) GS:ffff81051fc53d40(0000) knlGS:0000000000000000
Oct  6 02:56:18 mint1502 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
Oct  6 02:56:18 mint1502 kernel: CR2: 0000003d82e985d0 CR3: 0000000000201000 CR4: 00000000000006e0
Oct  6 02:56:18 mint1502 kernel:
Oct  6 02:56:18 mint1502 kernel: Call Trace:
Oct  6 02:56:18 mint1502 kernel:  [<ffffffff885962b8>] :obdclass:class_disconnect+0x378/0x400
Oct  6 02:56:18 mint1502 kernel:  [<ffffffff8860e590>] :ptlrpc:ldlm_cancel_locks_for_export_cb+0x0/0xc0
Oct  6 02:56:18 mint1502 kernel:  [<ffffffff8891b301>] :mds:mds_disconnect+0x121/0xe30
Oct  6 02:56:18 mint1502 kernel:  [<ffffffff8003d514>] lock_timer_base+0x1b/0x3c
Oct  6 02:56:19 mint1502 kernel:  [<ffffffff8004a824>] try_to_del_timer_sync+0x51/0x5a
Oct  6 02:56:19 mint1502 kernel:  [<ffffffff885920e4>] :obdclass:class_fail_export+0x384/0x4c0
Oct  6 02:56:19 mint1502 kernel:  [<ffffffff889150d1>] :mds:mds_readpage+0x1571/0x18a0
Oct  6 02:56:19 mint1502 kernel:  [<ffffffff886538d5>] :ptlrpc:lustre_msg_set_limit+0x35/0xf0
Oct  6 02:56:19 mint1502 kernel:  [<ffffffff8008abb9>] default_wake_function+0x0/0xe
Oct  6 02:56:19 mint1502 kernel:  [<ffffffff88918470>] :mds:mds_handle+0x2140/0x4c80
Oct  6 02:56:19 mint1502 kernel:  [<ffffffff80143809>] __next_cpu+0x19/0x28
Oct  6 02:56:19 mint1502 kernel:  [<ffffffff886539c5>] :ptlrpc:lustre_msg_get_conn_cnt+0x35/0xf0
Oct  6 02:56:19 mint1502 kernel:  [<ffffffff8865867d>] :ptlrpc:ptlrpc_check_req+0x1d/0x110
Oct  6 02:56:19 mint1502 kernel:  [<ffffffff8865ac03>] :ptlrpc:ptlrpc_server_handle_request+0xa93/0x1160
Oct  6 02:56:19 mint1502 kernel:  [<ffffffff80062f4b>] thread_return+0x0/0xdf
Oct  6 02:56:19 mint1502 kernel:  [<ffffffff800891f6>] __wake_up_common+0x3e/0x68
Oct  6 02:56:19 mint1502 kernel:  [<ffffffff8865e138>] :ptlrpc:ptlrpc_main+0x1218/0x13e0
Oct  6 02:56:19 mint1502 kernel:  [<ffffffff8008abb9>] default_wake_function+0x0/0xe
Oct  6 02:56:19 mint1502 kernel:  [<ffffffff800b4382>] audit_syscall_exit+0x31b/0x336
Oct  6 02:56:19 mint1502 kernel:  [<ffffffff8005dfb1>] child_rip+0xa/0x11
Oct  6 02:56:19 mint1502 kernel:  [<ffffffff8865cf20>] :ptlrpc:ptlrpc_main+0x0/0x13e0
Oct  6 02:56:19 mint1502 kernel:  [<ffffffff8005dfa7>] child_rip+0x0/0x11
Oct  6 02:56:19 mint1502 kernel:

Those go on for a while until it recovers:

Oct  6 03:01:22 mint1502 kernel: LustreError: Skipped 1 previous similar message
Oct  6 03:02:48 mint1502 kernel: Lustre: Request x1312747398018956 sent from content-OST000a-osc to NID 207.123.49.37 at tcp 26s ago has timed out (limit 26s).
Oct  6 03:02:48 mint1502 kernel: Lustre: Skipped 14 previous similar messages
Oct  6 03:03:12 mint1502 kernel: Lustre: 4713:0:(import.c:508:import_select_connection()) content-OST000a-osc: tried all connections, increasing latency to 26s
Oct  6 03:03:12 mint1502 kernel: Lustre: 4713:0:(import.c:508:import_select_connection()) Skipped 9 previous similar messages
Oct  6 03:03:35 mint1502 kernel: Lustre: MGS: haven't heard from client e5534e18-4e9d-cbf4-7f7f-deda49425fdb (at 207.123.49.37 at tcp) in 228 seconds. I think it's dead, and I am evicting it.
Oct  6 03:03:35 mint1502 kernel: Lustre: Skipped 3 previous similar messages
Oct  6 03:04:02 mint1502 kernel: Lustre: 4712:0:(import.c:837:ptlrpc_connect_interpret()) content-OST000a_UUID at 207.123.49.37@tcp changed server handle from 0x25a7c6432e7fffec to 0x25a7c643369c7038
Oct  6 03:04:02 mint1502 kernel: but is still in recovery
Oct  6 03:04:02 mint1502 kernel: LustreError: 167-0: This client was evicted by content-OST000a; in progress operations using this service will fail.
Oct  6 03:04:02 mint1502 kernel: LustreError: Skipped 1 previous similar message
Oct  6 03:04:02 mint1502 kernel: Lustre: 4712:0:(import.c:837:ptlrpc_connect_interpret()) content-OST000b_UUID at 207.123.49.37@tcp changed server handle from 0x25a7c6432e7ffff3 to 0x25a7c643369c703f
Oct  6 03:04:02 mint1502 kernel: but is still in recovery
Oct  6 03:04:02 mint1502 kernel: Lustre: 19512:0:(quota_master.c:1644:mds_quota_recovery()) Only 58/57 OSTs are active, abort quota recovery
Oct  6 03:04:02 mint1502 kernel: Lustre: 19512:0:(quota_master.c:1644:mds_quota_recovery()) Skipped 1 previous similar message
Oct  6 03:04:02 mint1502 kernel: Lustre: content-OST000a-osc: Connection restored to service content-OST000a using nid 207.123.49.37 at tcp.
Oct  6 03:04:02 mint1502 kernel: Lustre: Skipped 1 previous similar message
Oct  6 03:04:02 mint1502 kernel: Lustre: MDS content-MDT0000: content-OST000a_UUID now active, resetting orphans
Oct  6 03:04:02 mint1502 kernel: Lustre: Skipped 1 previous similar message

Anyone seen this before?

--
Andrew

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20091006/491037b1/attachment.htm>


More information about the lustre-discuss mailing list