[Lustre-discuss] 1.8.0 Losing connection to the MDT for several minutes and then recovering.

Christopher J.Walker C.J.Walker at qmul.ac.uk
Fri Oct 9 03:25:50 PDT 2009


Lundgren, Andrew wrote:
> I haven’t done anything yet.  The machines seem to reconnect without 
> intervention.   But the problem occurs again later.
> 
>  
> 
> *From:* Đào Thị Thảo [mailto:thaodt at isds.vn]
> *Sent:* Tuesday, October 06, 2009 9:43 PM
> *To:* Lundgren, Andrew
> *Cc:* lustre-discuss
> *Subject:* Re: [Lustre-discuss] 1.8.0 Loosing connection to the MDT for 
> several minutes and then recovering.
> 
>  
> 
> hi,
> I have the same problem with Lundgren.
> I don't understand why it's happen. While, my network is still stable.
> This problem is repeated. Some time, client can't connection restored to 
> service (OSTs or MDS) . My  provisional measure is rebooting the node.
> Andrew, can you explain more detail and guide how to fix it?
> 
> On Wed, Oct 7, 2009 at 12:20 AM, Lundgren, Andrew 
> <Andrew.Lundgren at level3.com <mailto:Andrew.Lundgren at level3.com>> wrote:
> 
> Oh man, that should have read LOSING! 
> 
>  
> 
> *From:* lustre-discuss-bounces at lists.lustre.org 
> <mailto:lustre-discuss-bounces at lists.lustre.org> 
> [mailto:lustre-discuss-bounces at lists.lustre.org 
> <mailto:lustre-discuss-bounces at lists.lustre.org>] *On Behalf Of 
> *Lundgren, Andrew
> *Sent:* Tuesday, October 06, 2009 11:14 AM
> *To:* lustre-discuss
> *Subject:* [Lustre-discuss] 1.8.0 Loosing connection to the MDT for 
> several minutes and then recovering.
> 
>  
> 
> We have a few 1.8.0 clusters running.  We have seen multiple instances 
> now where the clients lose connectivity to the MDT.  The MDS logs 
> indicate that there is some sort of problem on the MDT.
> 
>  
> 
> The following is a typical output:
> 
>  
> 
> Oct  6 02:56:08 mint1502 kernel: LustreError: 
> 28999:0:(handler.c:161:mds_sendpage()) @@@ bulk failed: timeout 0(4096), 
> evicting 7523f416-2579-5f49-cd3f-54d2438b84bb at NET_0x20000ce213b0b_UUID
> 
> Oct  6 02:56:08 mint1502 kernel:   req at ffff8100ac9f4000 
> x1314647461000449/t0 
> o37->7523f416-2579-5f49-cd3f-54d2438b84bb at NET_0x20000ce213b0b_UUID:0/0 
> lens 408/360 e 1 to 0 dl 1254797793 ref 1 fl Interpret:/0/0 rc 0/0
> 
> Oct  6 02:56:16 mint1502 kernel: Lustre: Request x1312747398000879 sent 
> from content-OST001d-osc to NID 207.123.49.16 at tcp 7s ago has timed out 
> (limit 7s).
> 
> Oct  6 02:56:16 mint1502 kernel: LustreError: 166-1: 
> content-OST001d-osc: Connection to service content-OST001d via nid 
> 207.123.49.16 at tcp was lost; in progress operations using this service 
> will fail.
> 
> Oct  6 02:56:16 mint1502 kernel: LustreError: Skipped 1 previous similar 
> message
> 
> Oct  6 02:56:17 mint1502 kernel: LustreError: 166-1: 
> content-OST001c-osc: Connection to service content-OST001c via nid 
> 207.123.49.16 at tcp was lost; in progress operations using this service 
> will fail.
> 
> Oct  6 02:56:17 mint1502 kernel: LustreError: Skipped 1 previous similar 
> message
> 
> Oct  6 02:56:18 mint1502 kernel: LustreError: 138-a: content-MDT0000: A 
> client on nid 207.123.49.16 at tcp was evicted due to a lock blocking 
> callback to 207.123.49.16 at tcp timed out: rc -107
> 
> Oct  6 02:56:18 mint1502 kernel: LustreError: 138-a: content-MDT0000: A 
> client on nid 207.123.49.9 at tcp was evicted due to a lock blocking 
> callback to 207.123.49.9 at tcp timed out: rc -107
> 
> Oct  6 02:56:18 mint1502 kernel: BUG: soft lockup - CPU#2 stuck for 10s! 
> [ll_mdt_rdpg_04:28999]
> 
> Oct  6 02:56:18 mint1502 kernel: CPU 2:
> 
> Oct  6 02:56:18 mint1502 kernel: Modules linked in: mds(U) 
> fsfilt_ldiskfs(U) mgs(U) mgc(U) ldiskfs(U) crc16(U) lustre(U) lov(U) 
> mdc(U) lquota(U) osc(U) ksocklnd(U) ptlrpc(U) obdclass(U) lnet(U) 
> lvfs(U) libcfs(U) ipv6(U) xfrm_nalgo(U) crypto_api(U) sunrpc(U) 
> bonding(U) ip_conntrack_netbios_ns(U) ipt_REJECT(U) xt_tcpudp(U) 
> xt_state(U) ip_conntrack(U) nfnetlink(U) iptable_filter(U) ip_tables(U) 
> x_tables(U) dm_round_robin(U) dm_rdac(U) dm_multipath(U) video(U) sbs(U) 
> backlight(U) i2c_ec(U) button(U) battery(U) asus_acpi(U) 
> acpi_memhotplug(U) ac(U) parport_pc(U) lp(U) parport(U) joydev(U) 
> i2c_i801(U) e1000e(U) sr_mod(U) i2c_core(U) i5000_edac(U) cdrom(U) 
> pata_acpi(U) shpchp(U) edac_mc(U) serio_raw(U) sg(U) pcspkr(U) 
> dm_snapshot(U) dm_zero(U) dm_mirror(U) dm_mod(U) usb_storage(U) lpfc(U) 
> scsi_transport_fc(U) ahci(U) ata_piix(U) libata(U) mptsas(U) mptscsih(U) 
> mptbase(U) scsi_transport_sas(U) sd_mod(U) scsi_mod(U) ext3(U) jbd(U) 
> uhci_hcd(U) ohci_hcd(U) ehci_hcd(U)
> 
> Oct  6 02:56:18 mint1502 kernel: Pid: 28999, comm: ll_mdt_rdpg_04 
> Tainted: G      2.6.18-92.1.17.el5_lustre.1.8.0smp #1
> 
> Oct  6 02:56:18 mint1502 kernel: RIP: 0010:[<ffffffff8858f470>]  
> [<ffffffff8858f470>] :obdclass:lustre_hash_for_each_empty+0x1f0/0x290
> 
> Oct  6 02:56:18 mint1502 kernel: RSP: 0018:ffff8104c02ad850  EFLAGS: 
> 00000206
> 
> Oct  6 02:56:18 mint1502 kernel: RAX: ffff810448dfd200 RBX: 
> 000000000000328c RCX: 000000000000ba75
> 
> Oct  6 02:56:18 mint1502 kernel: RDX: ffffffffffff5e7d RSI: 
> ffffffff802f0d80 RDI: ffffc200109d78cc
> 
> Oct  6 02:56:18 mint1502 kernel: RBP: ffffffff8860c8f2 R08: 
> ffff810001016e60 R09: 0000000000000000
> 
> Oct  6 02:56:18 mint1502 kernel: R10: ffff8100b7a28500 R11: 
> 0000000000000150 R12: ffff810448dfd200
> 
> Oct  6 02:56:18 mint1502 kernel: R13: 0000000000000286 R14: 
> 0000000000000286 R15: ffff8104c02ad7f0
> 
> Oct  6 02:56:18 mint1502 kernel: FS:  00002b49ff256220(0000) 
> GS:ffff81051fc53d40(0000) knlGS:0000000000000000
> 
> Oct  6 02:56:18 mint1502 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 
> 000000008005003b
> 
> Oct  6 02:56:18 mint1502 kernel: CR2: 0000003d82e985d0 CR3: 
> 0000000000201000 CR4: 00000000000006e0
> 
> Oct  6 02:56:18 mint1502 kernel:
> 
> Oct  6 02:56:18 mint1502 kernel: Call Trace:
> 
> Oct  6 02:56:18 mint1502 kernel:  [<ffffffff885962b8>] 
> :obdclass:class_disconnect+0x378/0x400
> 
> Oct  6 02:56:18 mint1502 kernel:  [<ffffffff8860e590>] 
> :ptlrpc:ldlm_cancel_locks_for_export_cb+0x0/0xc0
> 
> Oct  6 02:56:18 mint1502 kernel:  [<ffffffff8891b301>] 
> :mds:mds_disconnect+0x121/0xe30
> 
> Oct  6 02:56:18 mint1502 kernel:  [<ffffffff8003d514>] 
> lock_timer_base+0x1b/0x3c
> 
> Oct  6 02:56:19 mint1502 kernel:  [<ffffffff8004a824>] 
> try_to_del_timer_sync+0x51/0x5a
> 
> Oct  6 02:56:19 mint1502 kernel:  [<ffffffff885920e4>] 
> :obdclass:class_fail_export+0x384/0x4c0
> 
> Oct  6 02:56:19 mint1502 kernel:  [<ffffffff889150d1>] 
> :mds:mds_readpage+0x1571/0x18a0
> 
> Oct  6 02:56:19 mint1502 kernel:  [<ffffffff886538d5>] 
> :ptlrpc:lustre_msg_set_limit+0x35/0xf0
> 
> Oct  6 02:56:19 mint1502 kernel:  [<ffffffff8008abb9>] 
> default_wake_function+0x0/0xe
> 
> Oct  6 02:56:19 mint1502 kernel:  [<ffffffff88918470>] 
> :mds:mds_handle+0x2140/0x4c80
> 
> Oct  6 02:56:19 mint1502 kernel:  [<ffffffff80143809>] __next_cpu+0x19/0x28
> 
> Oct  6 02:56:19 mint1502 kernel:  [<ffffffff886539c5>] 
> :ptlrpc:lustre_msg_get_conn_cnt+0x35/0xf0
> 
> Oct  6 02:56:19 mint1502 kernel:  [<ffffffff8865867d>] 
> :ptlrpc:ptlrpc_check_req+0x1d/0x110
> 
> Oct  6 02:56:19 mint1502 kernel:  [<ffffffff8865ac03>] 
> :ptlrpc:ptlrpc_server_handle_request+0xa93/0x1160
> 
> Oct  6 02:56:19 mint1502 kernel:  [<ffffffff80062f4b>] 
> thread_return+0x0/0xdf
> 
> Oct  6 02:56:19 mint1502 kernel:  [<ffffffff800891f6>] 
> __wake_up_common+0x3e/0x68
> 
> Oct  6 02:56:19 mint1502 kernel:  [<ffffffff8865e138>] 
> :ptlrpc:ptlrpc_main+0x1218/0x13e0
> 
> Oct  6 02:56:19 mint1502 kernel:  [<ffffffff8008abb9>] 
> default_wake_function+0x0/0xe
> 
> Oct  6 02:56:19 mint1502 kernel:  [<ffffffff800b4382>] 
> audit_syscall_exit+0x31b/0x336
> 
> Oct  6 02:56:19 mint1502 kernel:  [<ffffffff8005dfb1>] child_rip+0xa/0x11
> 
> Oct  6 02:56:19 mint1502 kernel:  [<ffffffff8865cf20>] 
> :ptlrpc:ptlrpc_main+0x0/0x13e0
> 
> Oct  6 02:56:19 mint1502 kernel:  [<ffffffff8005dfa7>] child_rip+0x0/0x11
> 
> Oct  6 02:56:19 mint1502 kernel:
> 
>  
> 
> Those go on for a while until it recovers:
> 
>  
> 
> Oct  6 03:01:22 mint1502 kernel: LustreError: Skipped 1 previous similar 
> message
> 
> Oct  6 03:02:48 mint1502 kernel: Lustre: Request x1312747398018956 sent 
> from content-OST000a-osc to NID 207.123.49.37 at tcp 26s ago has timed out 
> (limit 26s).
> 
> Oct  6 03:02:48 mint1502 kernel: Lustre: Skipped 14 previous similar 
> messages
> 
> Oct  6 03:03:12 mint1502 kernel: Lustre: 
> 4713:0:(import.c:508:import_select_connection()) content-OST000a-osc: 
> tried all connections, increasing latency to 26s
> 
> Oct  6 03:03:12 mint1502 kernel: Lustre: 
> 4713:0:(import.c:508:import_select_connection()) Skipped 9 previous 
> similar messages
> 
> Oct  6 03:03:35 mint1502 kernel: Lustre: MGS: haven't heard from client 
> e5534e18-4e9d-cbf4-7f7f-deda49425fdb (at 207.123.49.37 at tcp) in 228 
> seconds. I think it's dead, and I am evicting it.
> 
> Oct  6 03:03:35 mint1502 kernel: Lustre: Skipped 3 previous similar messages
> 
> Oct  6 03:04:02 mint1502 kernel: Lustre: 
> 4712:0:(import.c:837:ptlrpc_connect_interpret()) 
> content-OST000a_UUID at 207.123.49.37 
> <mailto:content-OST000a_UUID at 207.123.49.37>@tcp changed server handle 
> from 0x25a7c6432e7fffec to 0x25a7c643369c7038
> 
> Oct  6 03:04:02 mint1502 kernel: but is still in recovery
> 
> Oct  6 03:04:02 mint1502 kernel: LustreError: 167-0: This client was 
> evicted by content-OST000a; in progress operations using this service 
> will fail.
> 
> Oct  6 03:04:02 mint1502 kernel: LustreError: Skipped 1 previous similar 
> message
> 
> Oct  6 03:04:02 mint1502 kernel: Lustre: 
> 4712:0:(import.c:837:ptlrpc_connect_interpret()) 
> content-OST000b_UUID at 207.123.49.37 
> <mailto:content-OST000b_UUID at 207.123.49.37>@tcp changed server handle 
> from 0x25a7c6432e7ffff3 to 0x25a7c643369c703f
> 
> Oct  6 03:04:02 mint1502 kernel: but is still in recovery
> 
> Oct  6 03:04:02 mint1502 kernel: Lustre: 
> 19512:0:(quota_master.c:1644:mds_quota_recovery()) Only 58/57 OSTs are 
> active, abort quota recovery
> 
> Oct  6 03:04:02 mint1502 kernel: Lustre: 
> 19512:0:(quota_master.c:1644:mds_quota_recovery()) Skipped 1 previous 
> similar message
> 
> Oct  6 03:04:02 mint1502 kernel: Lustre: content-OST000a-osc: Connection 
> restored to service content-OST000a using nid 207.123.49.37 at tcp.
> 
> Oct  6 03:04:02 mint1502 kernel: Lustre: Skipped 1 previous similar message
> 
> Oct  6 03:04:02 mint1502 kernel: Lustre: MDS content-MDT0000: 
> content-OST000a_UUID now active, resetting orphans
> 
> Oct  6 03:04:02 mint1502 kernel: Lustre: Skipped 1 previous similar message
> 
>  

I've just seen it with a 1.8.1 server (and 1.6 clients):

Oct  4 18:21:30 sn01 kernel: CPU 5:
Oct  4 18:21:30 sn01 kernel: Modules linked in: mds(U) fsfilt_ldiskfs(U) 
mgs(U) mgc(U) ldiskfs(U) crc16(U) lustre(U) lov(U) mdc(U) lquota(U) 
osc(U) ksocklnd(U) ptlrpc(U) ob
dclass(U) lnet(U) lvfs(U) libcfs(U) autofs4(U) hidp(U) rfcomm(U) 
l2cap(U) bluetooth(U) sunrpc(U) ipv6(U) xfrm_nalgo(U) crypto_api(U) 
dm_mirror(U) dm_log(U) dm_multipath(U)
scsi_dh(U) dm_mod(U) video(U) hwmon(U) backlight(U) sbs(U) i2c_ec(U) 
i2c_core(U) button(U) battery(U) asus_acpi(U) acpi_memhotplug(U) ac(U) 
parport_pc(U) lp(U) parport(U) s
g(U) shpchp(U) netxen_nic(U) ide_cd(U) serio_raw(U) i5000_edac(U) 
pcspkr(U) edac_mc(U) bnx2(U) cdrom(U) ata_piix(U) libata(U) mptsas(U) 
mptscsih(U) scsi_transport_sas(U) mp
tbase(U) megaraid_sas(U) sd_mod(U) scsi_mod(U) ext3(U) jbd(U) 
ehci_hcd(U) ohci_hcd(U) uhci_hcd(U)
Oct  4 18:21:30 sn01 kernel: Pid: 831, comm: ll_mdt_39 Tainted: G 
2.6.18-128.1.14.el5_lustre.1.8.1 #1
Oct  4 18:21:30 sn01 kernel: RIP: 0010:[<ffffffff88504ba0>] 
[<ffffffff88504ba0>] :obdclass:lustre_hash_for_each_empty+0x220/0x2b0
Oct  4 18:21:30 sn01 kernel: RSP: 0018:ffff810123af7a30  EFLAGS: 00000202
Oct  4 18:21:30 sn01 kernel: RAX: 000000000000ffff RBX: 0000000000002662 
RCX: 00000000000083dc
Oct  4 18:21:30 sn01 kernel: RDX: 00000000000002dc RSI: ffffffff802f9e80 
RDI: ffffc2001225361c
Oct  4 18:21:30 sn01 kernel: RBP: ffffffff885838e2 R08: ffff81000102fee0 
R09: 0000000000000000
Oct  4 18:21:30 sn01 kernel: R10: ffff8100aa9d5cc0 R11: 0000000000000150 
R12: ffff8100acf36800
Oct  4 18:21:30 sn01 kernel: R13: 0000000000000286 R14: 0000000000000286 
R15: ffff810123af79d0
Oct  4 18:21:30 sn01 kernel: FS:  00002ab6a6887230(0000) 
GS:ffff810106751740(0000) knlGS:0000000000000000
Oct  4 18:21:30 sn01 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 
000000008005003b
Oct  4 18:21:30 sn01 kernel: CR2: 00002ab47fe89000 CR3: 0000000000201000 
CR4: 00000000000006e0
Oct  4 18:21:30 sn01 kernel:
Oct  4 18:21:30 sn01 kernel: Call Trace:
Oct  4 18:21:30 sn01 kernel:  [<ffffffff8850ba88>] 
:obdclass:class_disconnect+0x398/0x420
Oct  4 18:21:30 sn01 kernel:  [<ffffffff888a65b1>] 
:mds:mds_disconnect+0x121/0xe40
Oct  4 18:21:30 sn01 kernel:  [<ffffffff885c5ea4>] 
:ptlrpc:lustre_msg_add_version+0x34/0x110
Oct  4 18:21:30 sn01 kernel:  [<ffffffff885cb834>] 
:ptlrpc:lustre_msg_set_timeout+0x34/0x110
Oct  4 18:21:30 sn01 kernel:  [<ffffffff885c8a79>] 
:ptlrpc:lustre_pack_reply+0x29/0xb0
Oct  4 18:21:30 sn01 kernel:  [<ffffffff88591381>] 
:ptlrpc:target_handle_disconnect+0x461/0x560
Oct  4 18:21:30 sn01 kernel:  [<ffffffff885c4f35>] 
:ptlrpc:lustre_msg_get_opc+0x35/0xf0
Oct  4 18:21:30 sn01 kernel:  [<ffffffff888a2324>] 
:mds:mds_handle+0xd74/0x4cb0
Oct  4 18:21:30 sn01 kernel:  [<ffffffff80148d4f>] __next_cpu+0x19/0x28
Oct  4 18:21:30 sn01 kernel:  [<ffffffff80088f32>] 
find_busiest_group+0x20d/0x621
Oct  4 18:21:30 sn01 kernel:  [<ffffffff885caa15>] 
:ptlrpc:lustre_msg_get_conn_cnt+0x35/0xf0
Oct  4 18:21:30 sn01 kernel:  [<ffffffff80089d89>] enqueue_task+0x41/0x56
Oct  4 18:21:30 sn01 kernel:  [<ffffffff885cf72d>] 
:ptlrpc:ptlrpc_check_req+0x1d/0x110
Oct  4 18:21:30 sn01 kernel:  [<ffffffff885d1e67>] 
:ptlrpc:ptlrpc_server_handle_request+0xa97/0x1160
Oct  4 18:21:30 sn01 kernel:  [<ffffffff8003dc3f>] lock_timer_base+0x1b/0x3c
Oct  4 18:21:30 sn01 kernel:  [<ffffffff80088819>] 
__wake_up_common+0x3e/0x68
Oct  4 18:21:30 sn01 kernel:  [<ffffffff885d5908>] 
:ptlrpc:ptlrpc_main+0x1218/0x13e0
Oct  4 18:21:30 sn01 kernel:  [<ffffffff8008a3ef>] 
default_wake_function+0x0/0xe
Oct  4 18:21:30 sn01 kernel:  [<ffffffff800b48dd>] 
audit_syscall_exit+0x327/0x342
Oct  4 18:21:30 sn01 kernel:  [<ffffffff8005dfb1>] child_rip+0xa/0x11
Oct  4 18:21:30 sn01 kernel:  [<ffffffff885d46f0>] 
:ptlrpc:ptlrpc_main+0x0/0x13e0
Oct  4 18:21:30 sn01 kernel:  [<ffffffff8005dfa7>] child_rip+0x0/0x11
Oct  4 18:21:30 sn01 kernel:
Oct  4 18:21:40 sn01 kernel: BUG: soft lockup - CPU#5 stuck for 10s! 
[ll_mdt_39:831]
Oct  4 18:21:40 sn01 kernel: CPU 5:
Oct  4 18:21:40 sn01 kernel: Modules linked in: mds(U) fsfilt_ldiskfs(U) 
mgs(U) mgc(U) ldiskfs(U) crc16(U) lustre(U) lov(U) mdc(U) lquota(U) 
osc(U) ksocklnd(U) ptlrpc(U) ob
dclass(U) lnet(U) lvfs(U) libcfs(U) autofs4(U) hidp(U) rfcomm(U) 
l2cap(U) bluetooth(U) sunrpc(U) ipv6(U) xfrm_nalgo(U) crypto_api(U) 
dm_mirror(U) dm_log(U) dm_multipath(U)
scsi_dh(U) dm_mod(U) video(U) hwmon(U) backlight(U) sbs(U) i2c_ec(U) 
i2c_core(U) button(U) battery(U) asus_acpi(U) acpi_memhotplug(U) ac(U) 
parport_pc(U) lp(U) parport(U) s
g(U) shpchp(U) netxen_nic(U) ide_cd(U) serio_raw(U) i5000_edac(U) 
pcspkr(U) edac_mc(U) bnx2(U) cdrom(U) ata_piix(U) libata(U) mptsas(U) 
mptscsih(U) scsi_transport_sas(U) mp
tbase(U) megaraid_sas(U) sd_mod(U) scsi_mod(U) ext3(U) jbd(U) 
ehci_hcd(U) ohci_hcd(U) uhci_hcd(U)
Oct  4 18:21:40 sn01 kernel: Pid: 831, comm: ll_mdt_39 Tainted: G 
2.6.18-128.1.14.el5_lustre.1.8.1 #1
Oct  4 18:21:40 sn01 kernel: RIP: 0010:[<ffffffff88504b98>] 
[<ffffffff88504b98>] :obdclass:lustre_hash_for_each_empty+0x218/0x2b0
Oct  4 18:21:40 sn01 kernel: RSP: 0018:ffff810123af7a30  EFLAGS: 00000206
Oct  4 18:21:40 sn01 kernel: RAX: 000000000000ffff RBX: 0000000000000543 
RCX: 000000000000bf15
Oct  4 18:21:40 sn01 kernel: RDX: 000000000000036e RSI: ffffffff802f9e80 
RDI: ffffc2001223243c
Oct  4 18:21:40 sn01 kernel: RBP: ffffffff885838e2 R08: ffff81000102fee0 
R09: 0000000000000000
Oct  4 18:21:40 sn01 kernel: R10: ffff8100a59c1e00 R11: 0000000000000150 
R12: ffff8100a5f6f800
Oct  4 18:21:40 sn01 kernel: R13: 0000000000000286 R14: 0000000000000286 
R15: ffff810123af79d0
Oct  4 18:21:40 sn01 kernel: FS:  00002ab6a6887230(0000) 
GS:ffff810106751740(0000) knlGS:0000000000000000
Oct  4 18:21:40 sn01 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 
000000008005003b
Oct  4 18:21:40 sn01 kernel: CR2: 00002ab47fe89000 CR3: 0000000000201000 
CR4: 00000000000006e0
Oct  4 18:21:40 sn01 kernel:
Oct  4 18:21:40 sn01 kernel:  [<ffffffff888a65b1>] 
:mds:mds_disconnect+0x121/0xe40
Oct  4 18:21:40 sn01 kernel:  [<ffffffff885c5ea4>] 
:ptlrpc:lustre_msg_add_version+0x34/0x110
Oct  4 18:21:40 sn01 kernel:  [<ffffffff885cb834>] 
:ptlrpc:lustre_msg_set_timeout+0x34/0x110
Oct  4 18:21:40 sn01 kernel:  [<ffffffff885c8a79>] 
:ptlrpc:lustre_pack_reply+0x29/0xb0
Oct  4 18:21:40 sn01 kernel:  [<ffffffff88591381>] 
:ptlrpc:target_handle_disconnect+0x461/0x560
Oct  4 18:21:40 sn01 kernel:  [<ffffffff885c4f35>] 
:ptlrpc:lustre_msg_get_opc+0x35/0xf0
Oct  4 18:21:40 sn01 kernel:  [<ffffffff888a2324>] 
:mds:mds_handle+0xd74/0x4cb0
Oct  4 18:21:40 sn01 kernel:  [<ffffffff80148d4f>] __next_cpu+0x19/0x28
Oct  4 18:21:40 sn01 kernel:  [<ffffffff80088f32>] 
find_busiest_group+0x20d/0x621
Oct  4 18:21:40 sn01 kernel:  [<ffffffff885caa15>] 
:ptlrpc:lustre_msg_get_conn_cnt+0x35/0xf0
Oct  4 18:21:40 sn01 kernel:  [<ffffffff80089d89>] enqueue_task+0x41/0x56
Oct  4 18:21:40 sn01 kernel:  [<ffffffff885cf72d>] 
:ptlrpc:ptlrpc_check_req+0x1d/0x110
Oct  4 18:21:40 sn01 kernel:  [<ffffffff885d1e67>] 
:ptlrpc:ptlrpc_server_handle_request+0xa97/0x1160
Oct  4 18:21:40 sn01 kernel:  [<ffffffff8003dc3f>] lock_timer_base+0x1b/0x3c
Oct  4 18:21:40 sn01 kernel:  [<ffffffff80088819>] 
__wake_up_common+0x3e/0x68
Oct  4 18:21:40 sn01 kernel:  [<ffffffff885d5908>] 
:ptlrpc:ptlrpc_main+0x1218/0x13e0
Oct  4 18:21:40 sn01 kernel:  [<ffffffff8008a3ef>] 
default_wake_function+0x0/0xe
Oct  4 18:21:40 sn01 kernel:  [<ffffffff800b48dd>] 
audit_syscall_exit+0x327/0x342
Oct  4 18:21:40 sn01 kernel:  [<ffffffff8005dfb1>] child_rip+0xa/0x11
Oct  4 18:21:40 sn01 kernel:  [<ffffffff885d46f0>] 
:ptlrpc:ptlrpc_main+0x0/0x13e0
Oct  4 18:21:40 sn01 kernel:  [<ffffffff8005dfa7>] child_rip+0x0/0x11
Oct  4 18:21:40 sn01 kernel:
s ago has timed out (limit 7s).
0/0 rc 0/0
  will wait for recovery to complete.
Oct  4 18:21:48 sn01 kernel: Lustre: Skipped 1 previous similar message
Oct  4 18:21:48 sn01 kernel: Lustre: lustre_0-OST001c-osc: Connection 
restored to service lustre_0-OST001c using nid 10.1.4.118 at tcp.
Oct  4 18:21:50 sn01 kernel: BUG: soft lockup - CPU#5 stuck for 10s! 
[ll_mdt_39:831]
Oct  4 18:21:50 sn01 kernel: CPU 5:
tbase(U) megaraid_sas(U) sd_mod(U) scsi_mod(U) ext3(U) jbd(U) 
ehci_hcd(U) ohci_hcd(U) uhci_hcd(U)
Oct  4 18:21:50 sn01 kernel: Pid: 831, comm: ll_mdt_39 Tainted: G 
2.6.18-128.1.14.el5_lustre.1.8.1 #1
Oct  4 18:21:50 sn01 kernel: RIP: 0010:[<ffffffff88504ba0>] 
[<ffffffff88504ba0>] :obdclass:lustre_hash_for_each_empty+0x220/0x2b0
Oct  4 18:21:50 sn01 kernel: RSP: 0018:ffff810123af7a30  EFLAGS: 00000202


Which repeats several times, before
Oct  4 18:24:31 sn01 kernel: Lustre: 
831:0:(service.c:1369:ptlrpc_server_handle_request()) @@@ Request 
x44330119 took longer than estimated (100+90s); client may timeout.
req at ffff810188454c00 x44330119/t0 
o39->e64e59dd-a66e-9210-ed51-5bcdebd7470d at NET_0x200000a010713_UUID:0/0 
lens 128/128 e 0 to 0 dl 1254676981 ref 1 fl Complete:/0/0 rc 0/0
Oct  4 18:24:31 sn01 kernel: LustreError: 
4417:0:(events.c:66:request_out_callback()) @@@ type 4, status -5 
req at ffff81009432b800 x1310439138925892/t0 o400->lustre_0-OST001
f_UUID at 10.1.4.118@tcp:28/4 lens 192/384 e 0 to 1 dl 1254676908 ref 1 fl 
Complete:XN/0/0 rc -11/0
Oct  4 18:24:31 sn01 kernel: LustreError: 
4417:0:(events.c:66:request_out_callback()) Skipped 6 previous similar 
messages
Oct  4 18:24:31 sn01 kernel: Lustre: 
864:0:(ldlm_lib.c:541:target_handle_reconnect()) lustre_0-MDT0000: 
dbb6a3f0-9aca-15cb-c931-831d85f03526 reconnecting
Oct  4 18:24:31 sn01 kernel: Lustre: 
885:0:(ldlm_lib.c:766:target_handle_connect()) lustre_0-MDT0000: exp 
ffff8101888d4200 already connecting
Oct  4 18:24:31 sn01 kernel: LustreError: 
885:0:(ldlm_lib.c:1850:target_send_reply_msg()) @@@ processing error 
(-114)  req at ffff8101ad1b6c00 x2087169/t0 o38->dbb6a3f0-9aca-1
5cb-c931-831d85f03526 at NET_0x200000a01070d_UUID:0/0 lens 304/200 e 0 to 0 
dl 1254677171 ref 1 fl Interpret:/0/0 rc -114/0
Oct  4 18:24:31 sn01 kernel: Lustre: 
4417:0:(lib-move.c:1782:lnet_parse_put()) Dropping PUT from 
12345-10.1.4.118 at tcp portal 4 match 1310439138926075 offset 0 length 264: 2
Oct  4 18:24:31 sn01 kernel: Lustre: 
4417:0:(lib-move.c:1782:lnet_parse_put()) Dropping PUT from 
12345-10.1.4.118 at tcp portal 4 match 1310439138926074 offset 0 length 264: 2
Oct  4 18:24:31 sn01 kernel: Lustre: 
4417:0:(lib-move.c:1782:lnet_parse_put()) Dropping PUT from 
12345-10.1.4.118 at tcp portal 4 match 1310439138926073 offset 0 length 264: 2
Oct  4 18:24:31 sn01 kernel: Lustre: 
4417:0:(lib-move.c:1782:lnet_parse_put()) Dropping PUT from 
12345-10.1.4.118 at tcp portal 4 match 1310439138926133 offset 0 length 264: 2
Oct  4 18:24:31 sn01 kernel: Lustre: 
4417:0:(lib-move.c:1782:lnet_parse_put()) Dropping PUT from 
12345-10.1.4.118 at tcp portal 4 match 1310439138926135 offset 0 length 264: 2
Oct  4 18:24:31 sn01 kernel: Lustre: 
4417:0:(lib-move.c:1782:lnet_parse_put()) Dropping PUT from 
12345-10.1.4.118 at tcp portal 4 match 1310439138926136 offset 0 length 264: 2
Oct  4 18:24:31 sn01 kernel: Lustre: 
4417:0:(lib-move.c:1782:lnet_parse_put()) Dropping PUT from 
12345-10.1.4.118 at tcp portal 4 match 1310439138926134 offset 0 length 264: 2
Oct  4 18:24:31 sn01 kernel: Lustre: 
4417:0:(lib-move.c:1782:lnet_parse_put()) Dropping PUT from 
12345-10.1.4.118 at tcp portal 4 match 1310439138926184 offset 0 length 264: 2
Oct  4 18:24:31 sn01 kernel: Lustre: 
4417:0:(lib-move.c:1782:lnet_parse_put()) Dropping PUT from 
12345-10.1.4.118 at tcp portal 4 match 1310439138926186 offset 0 length 264: 2
Oct  4 18:24:31 sn01 kernel: Lustre: 
4417:0:(lib-move.c:1782:lnet_parse_put()) Dropping PUT from 
12345-10.1.4.118 at tcp portal 4 match 1310439138926185 offset 0 length 264: 2
Oct  4 18:24:31 sn01 kernel: Lustre: 
4417:0:(lib-move.c:1782:lnet_parse_put()) Dropping PUT from 
12345-10.1.4.118 at tcp portal 4 match 1310439138926187 offset 0 length 264: 2
Oct  4 18:24:31 sn01 kernel: Lustre: lustre_0-OST001c-osc: Connection 
restored to service lustre_0-OST001c using nid 10.1.4.118 at tcp.





Chris



More information about the lustre-discuss mailing list