[Lustre-discuss] Rx failures

Ulrich Sibiller u.sibiller+lustre_discuss at science-computing.de
Tue Feb 2 08:10:55 PST 2010


Hi,

we are experiencing some weird behaviour on one of our Lustre clients.

First some information about our environment:
- Lustre 1.8.1.1 CentOS52 (Kernel 2.6.18-92.1.13.el5)
- self-compiled patchless clients with quotas enabled (quotas not used at the moment)
- Infiniband interconnect, OFED 1.3.3
- OSS: 2x Sun Fire X4540 with 48TB, with official sun kernel 2.6.18-128.7.1.el5_lustre.1.8.1.1, 4
OSTs on
each of them, OFED 1.3.3
- MDS: 2x Sun Fire X4100 M2 + 1x StorEdge 3320 with heartbeat failover, official sun kernel
2.6.18-128.7.1.el5_lustre.1.8.1.1, OFED 1.4.2
- Lustre mounted on /hpcscr
- OSS1 is 192.168.60.238 at o2ib, hostname is hpc9oss1
- OSS2 is 192.168.60.237 at o2ib, hostname is hpc9oss2
- MDS1 is 192.168.60.240 at o2ib, hostname is hpc9mds1 (active)
- MDS2 is 192.168.60.239 at o2ib, hostname is hpc9mds2 (standby, was active for a short time while
mds1 was lifted from 1.6.7.1 to 1.8.1.1)
- problematic client is 192.168.60.226 at o2ib, hostname hpc9master02
- no problems on the Infiniband

Problem:
Users report a slow Lustre filesystem on this particular machine (hpc9master02). Running "find
/hpcscr -ls" gets stuck after some time and most of the time it continues after some seconds, but
sometimes it takes several minutes and sometimes I get errors (one "I/O error", then several "Cannot
send after transport endpoint shutdown") and the find terminates. The IB error counters do not change
during this test.

This is the client syslog during a find run that aborted with the I/O error (see below for MDS and
OSS syslogs):
The output is shortened and commented, I can provide the complete logs if necessary.
--------------------------------------------
Feb  2 15:59:02 hpc9master02 kernel: Lustre: 5947:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from
192.168.60.240 at o2ib failed: 5
Feb  2 15:59:02 hpc9master02 last message repeated 11 times
Feb  2 15:59:02 hpc9master02 kernel: Lustre: 5946:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from
192.168.60.240 at o2ib failed: 5
Feb  2 15:59:02 hpc9master02 kernel: Lustre: 5947:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from
192.168.60.240 at o2ib failed: 5
Feb  2 15:59:02 hpc9master02 kernel: Lustre: 5947:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from
192.168.60.240 at o2ib failed: 5
Feb  2 15:59:02 hpc9master02 kernel: Lustre: 5946:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from
192.168.60.240 at o2ib failed: 5
Feb  2 15:59:02 hpc9master02 kernel: Lustre: 5947:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from
192.168.60.240 at o2ib failed: 5
Feb  2 15:59:02 hpc9master02 kernel: Lustre: 5946:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from
192.168.60.240 at o2ib failed: 5
Feb  2 15:59:02 hpc9master02 kernel: Lustre: 5157:0:(o2iblnd_cb.c:2384:kiblnd_reconnect())
192.168.60.240 at o2ib: retrying (stale), 12, 12, queue_dep: 8, max_frag: 256, msg_size: 4096

Then for 192.168.60.238 at o2ib:
Feb  2 15:59:02 hpc9master02 kernel: Lustre: 5947:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from
192.168.60.238 at o2ib failed: 5
Feb  2 15:59:03 hpc9master02 kernel: Lustre: 5157:0:(o2iblnd_cb.c:2384:kiblnd_reconnect())
192.168.60.238 at o2ib: retrying (stale), 12, 12, queue_dep: 8, max_frag: 256, msg_size: 4096

Then for 192.168.60.237 at o2ib:
Feb  2 15:59:03 hpc9master02 kernel: Lustre: 5947:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from
192.168.60.237 at o2ib failed: 5
Feb  2 15:59:03 hpc9master02 kernel: Lustre: 5942:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from
192.168.60.237 at o2ib failed: 5
Feb  2 15:59:03 hpc9master02 kernel: Lustre: 5942:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from
192.168.60.237 at o2ib failed: 5
Feb  2 15:59:03 hpc9master02 kernel: Lustre: 5157:0:(o2iblnd_cb.c:2384:kiblnd_reconnect())
192.168.60.237 at o2ib: retrying (stale), 12, 12, queue_dep: 8, max_frag: 256, msg_size: 4096

Again for 192.168.60.240 at o2ib:
Feb  2 15:59:03 hpc9master02 kernel: Lustre: 5947:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from
192.168.60.240 at o2ib failed: 5
Feb  2 15:59:03 hpc9master02 kernel: Lustre: 5157:0:(o2iblnd_cb.c:2384:kiblnd_reconnect())
192.168.60.240 at o2ib: retrying (stale), 12, 12, queue_dep: 8, max_frag: 256, msg_size: 4096

And then the connection gets closed:
Feb  2 16:00:17 hpc9master02 kernel: Lustre: 5151:0:(o2iblnd_cb.c:1785:kiblnd_close_conn_locked())
Closing conn to 192.168.60.240 at o2ib: error 0(waiting)
Feb  2 16:00:17 hpc9master02 kernel: LustreError: 5945:0:(events.c:66:request_out_callback()) @@@
type 4, status -5  req at ffff811f83619000 x1326214018219145/t0
o101->cfd1-MDT0000_UUID at 192.168.60.240@o2ib:12/10 lens 512/1232 e 0 t
o 1 dl 1265122824 ref 2 fl Rpc:P/0/0 rc 0/0

Afterwards it all starts over again until these messages appear:
Feb  2 16:02:54 hpc9master02 kernel: Lustre: 29726:0:(client.c:1383:ptlrpc_expire_one_request()) @@@
Request x1326214018348828 sent from cfd1-MDT0000-mdc-ffff811026fc1400 to NID 192.168.60.240 at o2ib 7s
ago has timed out (limit 7s).
Feb  2 16:02:54 hpc9master02 kernel:   req at ffff811ec8cb2c00 x1326214018348828/t0
o101->cfd1-MDT0000_UUID at 192.168.60.240@o2ib:12/10 lens 568/1232 e 0 to 1 dl 1265122974 ref 1 fl
Rpc:/0/0 rc 0/0
Feb  2 16:02:54 hpc9master02 kernel: Lustre: 29726:0:(client.c:1383:ptlrpc_expire_one_request())
Skipped 5 previous similar messages
Feb  2 16:02:54 hpc9master02 kernel: Lustre: cfd1-MDT0000-mdc-ffff811026fc1400: Connection to
service cfd1-MDT0000 via nid 192.168.60.240 at o2ib was lost; in progress operations using this service
will wait for recovery to complete.

Followed again by lots of RX failed messages...
--------------------------------------------

This is what the mds reports:
--------------------------------------------
Feb  2 15:58:12 hpc9mds1 kernel: Lustre: 5008:0:(o2iblnd_cb.c:1785:kiblnd_close_conn_locked())
Closing conn to 192.168.60.226 at o2ib: error -116
Feb  2 15:59:02 hpc9mds1 last message repeated 4 times
Feb  2 15:59:27 hpc9mds1 last message repeated 3 times
Feb  2 15:59:27 hpc9mds1 kernel: Lustre: 5008:0:(o2iblnd_cb.c:2232:kiblnd_passive_connect()) Conn
race 192.168.60.226 at o2ib
Feb  2 15:59:27 hpc9mds1 kernel: Lustre: 5008:0:(o2iblnd_cb.c:2232:kiblnd_passive_connect()) Skipped
1 previous similar message
Feb  2 15:59:27 hpc9mds1 kernel: Lustre: 5008:0:(o2iblnd_cb.c:1785:kiblnd_close_conn_locked())
Closing conn to 192.168.60.226 at o2ib: error -116
Feb  2 16:00:17 hpc9mds1 last message repeated 3 times
Feb  2 16:00:42 hpc9mds1 last message repeated 3 times
Feb  2 16:01:07 hpc9mds1 kernel: Lustre: 5007:0:(o2iblnd_cb.c:1785:kiblnd_close_conn_locked())
Closing conn to 192.168.60.226 at o2ib: error -116
Feb  2 16:01:07 hpc9mds1 kernel: Lustre: 5008:0:(o2iblnd_cb.c:1785:kiblnd_close_conn_locked())
Closing conn to 192.168.60.226 at o2ib: error -116
Feb  2 16:01:32 hpc9mds1 last message repeated 3 times
Feb  2 16:01:39 hpc9mds1 kernel: Lustre: 8664:0:(ldlm_lib.c:541:target_handle_reconnect())
cfd1-MDT0000: b80f9710-f288-9c2a-ee70-16c25f8ca97d reconnecting
Feb  2 16:01:39 hpc9mds1 kernel: Lustre: 8664:0:(ldlm_lib.c:541:target_handle_reconnect()) Skipped 2
previous similar messages
Feb  2 16:01:57 hpc9mds1 kernel: Lustre: 5008:0:(o2iblnd_cb.c:1785:kiblnd_close_conn_locked())
Closing conn to 192.168.60.226 at o2ib: error -116
Feb  2 16:02:47 hpc9mds1 last message repeated 4 times
Feb  2 16:02:47 hpc9mds1 kernel: Lustre: 5008:0:(o2iblnd_cb.c:2232:kiblnd_passive_connect()) Conn
race 192.168.60.226 at o2ib
Feb  2 16:02:47 hpc9mds1 kernel: Lustre: 5008:0:(o2iblnd_cb.c:2232:kiblnd_passive_connect()) Skipped
1 previous similar message
Feb  2 16:02:54 hpc9mds1 kernel: Lustre: 5008:0:(o2iblnd_cb.c:1785:kiblnd_close_conn_locked())
Closing conn to 192.168.60.226 at o2ib: error -116
Feb  2 16:03:37 hpc9mds1 last message repeated 3 times
Feb  2 16:04:52 hpc9mds1 last message repeated 6 times
Feb  2 16:06:07 hpc9mds1 last message repeated 8 times
Feb  2 16:07:22 hpc9mds1 last message repeated 7 times
Feb  2 16:07:22 hpc9mds1 kernel: Lustre: 5008:0:(o2iblnd_cb.c:2220:kiblnd_passive_connect()) Conn
stale 192.168.60.226 at o2ib [old ver: 12, new ver: 12]
Feb  2 16:07:22 hpc9mds1 kernel: Lustre: 5008:0:(o2iblnd_cb.c:2220:kiblnd_passive_connect()) Skipped
62 previous similar messages
Feb  2 16:07:29 hpc9mds1 kernel: Lustre: 5008:0:(o2iblnd_cb.c:1785:kiblnd_close_conn_locked())
Closing conn to 192.168.60.226 at o2ib: error -116
Feb  2 16:07:47 hpc9mds1 kernel: Lustre: 5008:0:(o2iblnd_cb.c:1785:kiblnd_close_conn_locked())
Closing conn to 192.168.60.226 at o2ib: error -116(waiting)
Feb  2 16:07:47 hpc9mds1 kernel: LustreError: 5781:0:(events.c:367:server_bulk_callback()) event
type 4, status -103, desc ffff81010df46d80
Feb  2 16:07:53 hpc9mds1 kernel: LustreError: 25108:0:(handler.c:161:mds_sendpage()) @@@ bulk
failed: timeout 0(4096), evicting b80f9710-f288-9c2a-ee70-16c25f8ca97d at NET_0x50000c0a83ce2_UUID
Feb  2 16:07:53 hpc9mds1 kernel:   req at ffff8100a7e2f000 x1326214018567382/t0
o37->b80f9710-f288-9c2a-ee70-16c25f8ca97d at NET_0x50000c0a83ce2_UUID:0/0 lens 408/360 e 1 to 0 dl
1265123298 ref 1 fl Interpret:/0/0 rc 0/0
Feb  2 16:07:54 hpc9mds1 kernel: Lustre: 5007:0:(o2iblnd_cb.c:1785:kiblnd_close_conn_locked())
Closing conn to 192.168.60.226 at o2ib: error -116
Feb  2 16:08:03 hpc9mds1 kernel: BUG: soft lockup - CPU#1 stuck for 10s! [ll_mdt_rdpg_12:25108]
Feb  2 16:08:03 hpc9mds1 kernel: CPU 1:
Feb  2 16:08:03 hpc9mds1 kernel: Modules linked in: mds(U) fsfilt_ldiskfs(U) mgs(U) mgc(U)
ldiskfs(U) crc16(U) nfsd(U) exportfs(U) lockd(U) nfs_acl(U) auth_rpcgss(U) ipmi_devintf(U)
ipmi_si(U) ipmi_msghandler(U) hidp(U) rfcomm(U
) l2cap(U) bluetooth(U) sunrpc(U) lustre(U) lov(U) mdc(U) lquota(U) osc(U) ksocklnd(U) ko2iblnd(U)
ptlrpc(U) obdclass(U) lnet(U) lvfs(U) libcfs(U) rdma_ucm(U) qlgc_vnic(U) ib_sdp(U) rdma_cm(U)
iw_cm(U) ib_addr(U) ib_ipoib(U) ipo
ib_helper(U) ib_cm(U) ib_sa(U) ipv6(U) xfrm_nalgo(U) crypto_api(U) ib_uverbs(U) ib_umad(U)
iw_cxgb3(U) cxgb3(U) ib_ipath(U) ib_mthca(U) mptctl(U) dm_multipath(U) scsi_dh(U) video(U)
backlight(U) sbs(U) i2c_ec(U) button(U) batter
y(U) asus_acpi(U) acpi_memhotplug(U) ac(U) parport_pc(U) lp(U) parport(U) mlx4_ib(U) ib_mad(U)
ib_core(U) joydev(U) mlx4_core(U) i2c_nforce2(U) sr_mod(U) pcspkr(U) ide_cd(U) serio_raw(U)
i2c_core(U) shpchp(U) e1000(U) k8temp(U)
forcedeth(U) hwmon(U) cdrom(U) sg(U) edac_mc(U) dm_snapshot(U) dm_zero(U) dm_mirror(U) dm_log(U)
dm_mod(U) usb_stora
Feb  2 16:08:03 hpc9mds1 kernel: e(U) mptfc(U) scsi_transport_fc(U) mptspi(U) scsi_transport_spi(U)
mptsas(U) mptscsih(U) mptbase(U) scsi_transport_sas(U) sd_mod(U) scsi_mod(U) ext3(U) jbd(U)
uhci_hcd(U) ohci_hcd(U) ehci_hcd(U)
Feb  2 16:08:03 hpc9mds1 kernel: Pid: 25108, comm: ll_mdt_rdpg_12 Tainted: G
2.6.18-128.7.1.el5_lustre.1.8.1.1 #1
Feb  2 16:08:03 hpc9mds1 kernel: RIP: 0010:[<ffffffff8006de08>]  [<ffffffff8006de08>]
do_gettimeoffset_hpet+0x7/0x28
Feb  2 16:08:03 hpc9mds1 kernel: RSP: 0018:ffff8102050555b8  EFLAGS: 00000206
Feb  2 16:08:03 hpc9mds1 kernel: RAX: 0000000051f252e8 RBX: ffff8102050555f0 RCX: 0000000000000383
Feb  2 16:08:03 hpc9mds1 kernel: RDX: 0000000000000040 RSI: 0000000000010000 RDI: ffff8102050555f0
Feb  2 16:08:03 hpc9mds1 kernel: RBP: 6666666638313031 R08: 0000000000000000 R09: 0000000000000383
Feb  2 16:08:03 hpc9mds1 kernel: R10: 0000000000000000 R11: 0000000000000000 R12: 6463313434363031
Feb  2 16:08:03 hpc9mds1 kernel: R13: 0000000000000000 R14: 00000000ffffffff R15: ffff81010f780080
Feb  2 16:08:03 hpc9mds1 kernel: FS:  00002b3759c5c220(0000) GS:ffff810103f99440(0000)
knlGS:00000000f7f9d6c0
Feb  2 16:08:03 hpc9mds1 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
Feb  2 16:08:03 hpc9mds1 kernel: CR2: 00002b3ef0e64240 CR3: 0000000000201000 CR4: 00000000000006e0
Feb  2 16:08:03 hpc9mds1 kernel:
Feb  2 16:08:03 hpc9mds1 kernel: Call Trace:
Feb  2 16:08:03 hpc9mds1 kernel:  [<ffffffff8006dc92>] do_gettimeofday+0x40/0x8f
Feb  2 16:08:03 hpc9mds1 kernel:  [<ffffffff88636164>] :libcfs:set_ptldebug_header+0x34/0xa0
Feb  2 16:08:03 hpc9mds1 kernel:  [<ffffffff8863e300>] :libcfs:libcfs_debug_vmsg2+0x70/0x990
Feb  2 16:08:03 hpc9mds1 kernel:  [<ffffffff8878a7ef>] :ptlrpc:ptlrpc_set_destroy+0x37f/0x390
Feb  2 16:08:03 hpc9mds1 kernel:  [<ffffffff887555e0>] :ptlrpc:ldlm_cancel_locks_for_export_cb+0x0/0xc0
Feb  2 16:08:03 hpc9mds1 kernel:  [<ffffffff887555e0>] :ptlrpc:ldlm_cancel_locks_for_export_cb+0x0/0xc0
Feb  2 16:08:03 hpc9mds1 kernel:  [<ffffffff887588d5>] :ptlrpc:ldlm_resource_putref+0x395/0x3a0
Feb  2 16:08:03 hpc9mds1 kernel:  [<ffffffff887555e0>] :ptlrpc:ldlm_cancel_locks_for_export_cb+0x0/0xc0
Feb  2 16:08:03 hpc9mds1 kernel:  [<ffffffff887578f9>] :ptlrpc:ldlm_resource_getref+0xc9/0xe0
Feb  2 16:08:03 hpc9mds1 kernel:  [<ffffffff88755601>] :ptlrpc:ldlm_cancel_locks_for_export_cb+0x21/0xc0
Feb  2 16:08:03 hpc9mds1 kernel:  [<ffffffff886d4b3b>] :obdclass:lustre_hash_for_each_empty+0x1cb/0x2b0
Feb  2 16:08:03 hpc9mds1 kernel:  [<ffffffff88bf05e1>] :mds:mds_disconnect+0x121/0xe40
Feb  2 16:08:03 hpc9mds1 kernel:  [<ffffffff8003d382>] lock_timer_base+0x1b/0x3c
Feb  2 16:08:03 hpc9mds1 kernel:  [<ffffffff8004aa63>] try_to_del_timer_sync+0x51/0x5a
Feb  2 16:08:03 hpc9mds1 kernel:  [<ffffffff886d7994>] :obdclass:class_fail_export+0x384/0x4c0
Feb  2 16:08:03 hpc9mds1 kernel:  [<ffffffff88bea311>] :mds:mds_readpage+0x1591/0x18c0
Feb  2 16:08:03 hpc9mds1 kernel:  [<ffffffff8001a1b8>] vsnprintf+0x559/0x59e
Feb  2 16:08:03 hpc9mds1 kernel:  [<ffffffff8008a3f3>] default_wake_function+0x0/0xe
Feb  2 16:08:03 hpc9mds1 kernel:  [<ffffffff88bed732>] :mds:mds_handle+0x2172/0x4cd0
Feb  2 16:08:03 hpc9mds1 kernel:  [<ffffffff8863b5d8>] :libcfs:libcfs_ip_addr2str+0x38/0x40
Feb  2 16:08:03 hpc9mds1 kernel:  [<ffffffff8863b9ee>] :libcfs:libcfs_nid2str+0xbe/0x110
Feb  2 16:08:04 hpc9mds1 kernel:  [<ffffffff887a2357>] :ptlrpc:ptlrpc_server_handle_request+0xa97/0x1170
Feb  2 16:08:04 hpc9mds1 kernel:  [<ffffffff8008881d>] __wake_up_common+0x3e/0x68
Feb  2 16:08:04 hpc9mds1 kernel:  [<ffffffff887a5e08>] :ptlrpc:ptlrpc_main+0x1218/0x13e0
Feb  2 16:08:04 hpc9mds1 kernel:  [<ffffffff8008a3f3>] default_wake_function+0x0/0xe
Feb  2 16:08:04 hpc9mds1 kernel:  [<ffffffff800b48f2>] audit_syscall_exit+0x33c/0x357
Feb  2 16:08:04 hpc9mds1 kernel:  [<ffffffff8005dfb1>] child_rip+0xa/0x11
Feb  2 16:08:04 hpc9mds1 kernel:  [<ffffffff887a4bf0>] :ptlrpc:ptlrpc_main+0x0/0x13e0
Feb  2 16:08:04 hpc9mds1 kernel:  [<ffffffff8005dfa7>] child_rip+0x0/0x11
Feb  2 16:08:04 hpc9mds1 kernel:
Feb  2 16:08:12 hpc9mds1 kernel: Lustre: 5007:0:(o2iblnd_cb.c:1785:kiblnd_close_conn_locked())
Closing conn to 192.168.60.226 at o2ib: error -116
Feb  2 16:08:13 hpc9mds1 kernel: BUG: soft lockup - CPU#1 stuck for 10s! [ll_mdt_rdpg_12:25108]
Feb  2 16:08:13 hpc9mds1 kernel: CPU 1:
Feb  2 16:08:13 hpc9mds1 kernel: Modules linked in: mds(U) fsfilt_ldiskfs(U) mgs(U) mgc(U)
ldiskfs(U) crc16(U) nfsd(U) exportfs(U) lockd(U) nfs_acl(U) auth_rpcgss(U) ipmi_devintf(U)
ipmi_si(U) ipmi_msghandler(U) hidp(U) rfcomm(U
) l2cap(U) bluetooth(U) sunrpc(U) lustre(U) lov(U) mdc(U) lquota(U) osc(U) ksocklnd(U) ko2iblnd(U)
ptlrpc(U) obdclass(U) lnet(U) lvfs(U) libcfs(U) rdma_ucm(U) qlgc_vnic(U) ib_sdp(U) rdma_cm(U)
iw_cm(U) ib_addr(U) ib_ipoib(U) ipo
ib_helper(U) ib_cm(U) ib_sa(U) ipv6(U) xfrm_nalgo(U) crypto_api(U) ib_uverbs(U) ib_umad(U)
iw_cxgb3(U) cxgb3(U) ib_ipath(U) ib_mthca(U) mptctl(U) dm_multipath(U) scsi_dh(U) video(U)
backlight(U) sbs(U) i2c_ec(U) button(U) batter
y(U) asus_acpi(U) acpi_memhotplug(U) ac(U) parport_pc(U) lp(U) parport(U) mlx4_ib(U) ib_mad(U)
ib_core(U) joydev(U) mlx4_core(U) i2c_nforce2(U) sr_mod(U) pcspkr(U) ide_cd(U) serio_raw(U)
i2c_core(U) shpchp(U) e1000(U) k8temp(U)
forcedeth(U) hwmon(U) cdrom(U) sg(U) edac_mc(U) dm_snapshot(U) dm_zero(U) dm_mirror(U) dm_log(U)
dm_mod(U) usb_stora
Feb  2 16:08:13 hpc9mds1 kernel: e(U) mptfc(U) scsi_transport_fc(U) mptspi(U) scsi_transport_spi(U)
mptsas(U) mptscsih(U) mptbase(U) scsi_transport_sas(U) sd_mod(U) scsi_mod(U) ext3(U) jbd(U)
uhci_hcd(U) ohci_hcd(U) ehci_hcd(U)
Feb  2 16:08:13 hpc9mds1 kernel: Pid: 25108, comm: ll_mdt_rdpg_12 Tainted: G
2.6.18-128.7.1.el5_lustre.1.8.1.1 #1
Feb  2 16:08:13 hpc9mds1 kernel: RIP: 0010:[<ffffffff80064ba8>]  [<ffffffff80064ba8>]
_spin_unlock_irqrestore+0x8/0x9
Feb  2 16:08:13 hpc9mds1 kernel: RSP: 0018:ffff810205055578  EFLAGS: 00000293
Feb  2 16:08:13 hpc9mds1 kernel: RAX: 000000000000037e RBX: ffff81021f8382a0 RCX: 0000000000000001
Feb  2 16:08:13 hpc9mds1 kernel: RDX: 000000000000032e RSI: 0000000000000293 RDI: ffff81010f780080
Feb  2 16:08:13 hpc9mds1 kernel: RBP: ffff810205055610 R08: 00000000ffffffff R09: 0000000000000020
Feb  2 16:08:13 hpc9mds1 kernel: R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
Feb  2 16:08:13 hpc9mds1 kernel: R13: ffffffff8001a1b8 R14: ffff8100d9680000 R15: 0000000000000000
Feb  2 16:08:13 hpc9mds1 kernel: FS:  00002b3759c5c220(0000) GS:ffff810103f99440(0000)
knlGS:00000000f7f9d6c0
Feb  2 16:08:13 hpc9mds1 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
Feb  2 16:08:13 hpc9mds1 kernel: CR2: 00002b3ef0e64240 CR3: 0000000000201000 CR4: 00000000000006e0
Feb  2 16:08:13 hpc9mds1 kernel:
Feb  2 16:08:13 hpc9mds1 kernel: Call Trace:
Feb  2 16:08:13 hpc9mds1 kernel:  [<ffffffff8863e978>] :libcfs:libcfs_debug_vmsg2+0x6e8/0x990
Feb  2 16:08:13 hpc9mds1 kernel:  [<ffffffff887555e0>] :ptlrpc:ldlm_cancel_locks_for_export_cb+0x0/0xc0
Feb  2 16:08:13 hpc9mds1 kernel:  [<ffffffff88753935>] :ptlrpc:ldlm_lock_put+0x3c5/0x3d0
Feb  2 16:08:13 hpc9mds1 kernel:  [<ffffffff887555e0>] :ptlrpc:ldlm_cancel_locks_for_export_cb+0x0/0xc0
Feb  2 16:08:13 hpc9mds1 kernel:  [<ffffffff88771f31>] :ptlrpc:ldlm_export_lock_put+0xd1/0xe0
Feb  2 16:08:13 hpc9mds1 kernel:  [<ffffffff887555e0>] :ptlrpc:ldlm_cancel_locks_for_export_cb+0x0/0xc0
Feb  2 16:08:13 hpc9mds1 kernel:  [<ffffffff886d523c>] :obdclass:lustre_hash_del+0x33c/0x370
Feb  2 16:08:13 hpc9mds1 kernel:  [<ffffffff8875311a>] :ptlrpc:ldlm_lock_destroy_internal+0x1ca/0x250
Feb  2 16:08:13 hpc9mds1 kernel:  [<ffffffff88755408>] :ptlrpc:ldlm_lock_destroy_nolock+0x68/0xd0
Feb  2 16:08:13 hpc9mds1 kernel:  [<ffffffff88755548>] :ptlrpc:ldlm_lock_cancel+0xd8/0x170
Feb  2 16:08:13 hpc9mds1 kernel:  [<ffffffff8875566e>] :ptlrpc:ldlm_cancel_locks_for_export_cb+0x8e/0xc0
Feb  2 16:08:13 hpc9mds1 kernel:  [<ffffffff886d4b3b>] :obdclass:lustre_hash_for_each_empty+0x1cb/0x2b0
Feb  2 16:08:13 hpc9mds1 kernel:  [<ffffffff88bf05e1>] :mds:mds_disconnect+0x121/0xe40
Feb  2 16:08:13 hpc9mds1 kernel:  [<ffffffff8003d382>] lock_timer_base+0x1b/0x3c
Feb  2 16:08:13 hpc9mds1 kernel:  [<ffffffff8004aa63>] try_to_del_timer_sync+0x51/0x5a
Feb  2 16:08:13 hpc9mds1 kernel:  [<ffffffff886d7994>] :obdclass:class_fail_export+0x384/0x4c0
Feb  2 16:08:13 hpc9mds1 kernel:  [<ffffffff88bea311>] :mds:mds_readpage+0x1591/0x18c0
Feb  2 16:08:13 hpc9mds1 kernel:  [<ffffffff8001a1b8>] vsnprintf+0x559/0x59e
Feb  2 16:08:14 hpc9mds1 kernel:  [<ffffffff8008a3f3>] default_wake_function+0x0/0xe
Feb  2 16:08:14 hpc9mds1 kernel:  [<ffffffff88bed732>] :mds:mds_handle+0x2172/0x4cd0
Feb  2 16:08:14 hpc9mds1 kernel:  [<ffffffff8863b5d8>] :libcfs:libcfs_ip_addr2str+0x38/0x40
Feb  2 16:08:14 hpc9mds1 kernel:  [<ffffffff8863b9ee>] :libcfs:libcfs_nid2str+0xbe/0x110
Feb  2 16:08:14 hpc9mds1 kernel:  [<ffffffff887a2357>] :ptlrpc:ptlrpc_server_handle_request+0xa97/0x1170
Feb  2 16:08:14 hpc9mds1 kernel:  [<ffffffff8008881d>] __wake_up_common+0x3e/0x68
Feb  2 16:08:14 hpc9mds1 kernel:  [<ffffffff887a5e08>] :ptlrpc:ptlrpc_main+0x1218/0x13e0
Feb  2 16:08:14 hpc9mds1 kernel:  [<ffffffff8008a3f3>] default_wake_function+0x0/0xe
Feb  2 16:08:14 hpc9mds1 kernel:  [<ffffffff800b48f2>] audit_syscall_exit+0x33c/0x357
Feb  2 16:08:14 hpc9mds1 kernel:  [<ffffffff8005dfb1>] child_rip+0xa/0x11
Feb  2 16:08:14 hpc9mds1 kernel:  [<ffffffff887a4bf0>] :ptlrpc:ptlrpc_main+0x0/0x13e0
Feb  2 16:08:14 hpc9mds1 kernel:  [<ffffffff8005dfa7>] child_rip+0x0/0x11
Feb  2 16:08:14 hpc9mds1 kernel:
Feb  2 16:08:19 hpc9mds1 kernel: Lustre: 5007:0:(o2iblnd_cb.c:1785:kiblnd_close_conn_locked())
Closing conn to 192.168.60.226 at o2ib: error -116
Feb  2 16:08:23 hpc9mds1 kernel: BUG: soft lockup - CPU#1 stuck for 10s! [ll_mdt_rdpg_12:25108]
Feb  2 16:08:23 hpc9mds1 kernel: CPU 1:
Feb  2 16:08:23 hpc9mds1 kernel: Modules linked in: mds(U) fsfilt_ldiskfs(U) mgs(U) mgc(U)
ldiskfs(U) crc16(U) nfsd(U) exportfs(U) lockd(U) nfs_acl(U) auth_rpcgss(U) ipmi_devintf(U)
ipmi_si(U) ipmi_msghandler(U) hidp(U) rfcomm(U
) l2cap(U) bluetooth(U) sunrpc(U) lustre(U) lov(U) mdc(U) lquota(U) osc(U) ksocklnd(U) ko2iblnd(U)
ptlrpc(U) obdclass(U) lnet(U) lvfs(U) libcfs(U) rdma_ucm(U) qlgc_vnic(U) ib_sdp(U) rdma_cm(U)
iw_cm(U) ib_addr(U) ib_ipoib(U) ipo
ib_helper(U) ib_cm(U) ib_sa(U) ipv6(U) xfrm_nalgo(U) crypto_api(U) ib_uverbs(U) ib_umad(U)
iw_cxgb3(U) cxgb3(U) ib_ipath(U) ib_mthca(U) mptctl(U) dm_multipath(U) scsi_dh(U) video(U)
backlight(U) sbs(U) i2c_ec(U) button(U) batter
y(U) asus_acpi(U) acpi_memhotplug(U) ac(U) parport_pc(U) lp(U) parport(U) mlx4_ib(U) ib_mad(U)
ib_core(U) joydev(U) mlx4_core(U) i2c_nforce2(U) sr_mod(U) pcspkr(U) ide_cd(U) serio_raw(U)
i2c_core(U) shpchp(U) e1000(U) k8temp(U)
forcedeth(U) hwmon(U) cdrom(U) sg(U) edac_mc(U) dm_snapshot(U) dm_zero(U) dm_mirror(U) dm_log(U)
dm_mod(U) usb_stora
Feb  2 16:08:23 hpc9mds1 kernel: e(U) mptfc(U) scsi_transport_fc(U) mptspi(U) scsi_transport_spi(U)
mptsas(U) mptscsih(U) mptbase(U) scsi_transport_sas(U) sd_mod(U) scsi_mod(U) ext3(U) jbd(U)
uhci_hcd(U) ohci_hcd(U) ehci_hcd(U)
Feb  2 16:08:23 hpc9mds1 kernel: Pid: 25108, comm: ll_mdt_rdpg_12 Tainted: G
2.6.18-128.7.1.el5_lustre.1.8.1.1 #1
Feb  2 16:08:23 hpc9mds1 kernel: RIP: 0010:[<ffffffff886d4b88>]  [<ffffffff886d4b88>]
:obdclass:lustre_hash_for_each_empty+0x218/0x2b0
Feb  2 16:08:23 hpc9mds1 kernel: RSP: 0018:ffff810205055830  EFLAGS: 00000206
Feb  2 16:08:23 hpc9mds1 kernel: RAX: 000000000000ffff RBX: 0000000000004ce1 RCX: 0000000000000001
Feb  2 16:08:23 hpc9mds1 kernel: RDX: 0000000000000c20 RSI: 0000000000000293 RDI: ffffc20005c3be1c
Feb  2 16:08:23 hpc9mds1 kernel: RBP: ffffffff88753935 R08: 00000000ffffffff R09: 0000000000000020
Feb  2 16:08:23 hpc9mds1 kernel: R10: 0000000000000000 R11: 0000000000000000 R12: ffffffff887555e0
Feb  2 16:08:23 hpc9mds1 kernel: R13: ffff8101f16b9df8 R14: ffff8101e17ed888 R15: ffff8101f16b9dc0
Feb  2 16:08:23 hpc9mds1 kernel: FS:  00002b3759c5c220(0000) GS:ffff810103f99440(0000)
knlGS:00000000f7f9d6c0
Feb  2 16:08:23 hpc9mds1 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
Feb  2 16:08:23 hpc9mds1 kernel: CR2: 00002b3ef0e64240 CR3: 0000000000201000 CR4: 00000000000006e0
Feb  2 16:08:23 hpc9mds1 kernel:
Feb  2 16:08:23 hpc9mds1 kernel: Call Trace:
Feb  2 16:08:23 hpc9mds1 kernel:  [<ffffffff88bf05e1>] :mds:mds_disconnect+0x121/0xe40
Feb  2 16:08:23 hpc9mds1 kernel:  [<ffffffff8003d382>] lock_timer_base+0x1b/0x3c
Feb  2 16:08:23 hpc9mds1 kernel:  [<ffffffff8004aa63>] try_to_del_timer_sync+0x51/0x5a
Feb  2 16:08:23 hpc9mds1 kernel:  [<ffffffff886d7994>] :obdclass:class_fail_export+0x384/0x4c0
Feb  2 16:08:24 hpc9mds1 kernel:  [<ffffffff88bea311>] :mds:mds_readpage+0x1591/0x18c0
Feb  2 16:08:24 hpc9mds1 kernel:  [<ffffffff8001a1b8>] vsnprintf+0x559/0x59e
Feb  2 16:08:24 hpc9mds1 kernel:  [<ffffffff8008a3f3>] default_wake_function+0x0/0xe
Feb  2 16:08:24 hpc9mds1 kernel:  [<ffffffff88bed732>] :mds:mds_handle+0x2172/0x4cd0
Feb  2 16:08:24 hpc9mds1 kernel:  [<ffffffff8863b5d8>] :libcfs:libcfs_ip_addr2str+0x38/0x40
Feb  2 16:08:24 hpc9mds1 kernel:  [<ffffffff8863b9ee>] :libcfs:libcfs_nid2str+0xbe/0x110
Feb  2 16:08:24 hpc9mds1 kernel:  [<ffffffff887a2357>] :ptlrpc:ptlrpc_server_handle_request+0xa97/0x1170
Feb  2 16:08:24 hpc9mds1 kernel:  [<ffffffff8008881d>] __wake_up_common+0x3e/0x68
Feb  2 16:08:24 hpc9mds1 kernel:  [<ffffffff887a5e08>] :ptlrpc:ptlrpc_main+0x1218/0x13e0
Feb  2 16:08:24 hpc9mds1 kernel:  [<ffffffff8008a3f3>] default_wake_function+0x0/0xe
Feb  2 16:08:24 hpc9mds1 kernel:  [<ffffffff800b48f2>] audit_syscall_exit+0x33c/0x357
Feb  2 16:08:24 hpc9mds1 kernel:  [<ffffffff8005dfb1>] child_rip+0xa/0x11
Feb  2 16:08:24 hpc9mds1 kernel:  [<ffffffff887a4bf0>] :ptlrpc:ptlrpc_main+0x0/0x13e0
Feb  2 16:08:24 hpc9mds1 kernel:  [<ffffffff8005dfa7>] child_rip+0x0/0x11
Feb  2 16:08:24 hpc9mds1 kernel:
Feb  2 16:08:37 hpc9mds1 kernel: Lustre: 5007:0:(o2iblnd_cb.c:1785:kiblnd_close_conn_locked())
Closing conn to 192.168.60.226 at o2ib: error -116
Feb  2 16:08:44 hpc9mds1 kernel: Lustre: 5007:0:(o2iblnd_cb.c:1785:kiblnd_close_conn_locked())
Closing conn to 192.168.60.226 at o2ib: error -116
Feb  2 16:08:47 hpc9mds1 kernel: BUG: soft lockup - CPU#1 stuck for 10s! [ll_mdt_rdpg_12:25108]
Feb  2 16:08:47 hpc9mds1 kernel: CPU 1:
Feb  2 16:08:47 hpc9mds1 kernel: Modules linked in: mds(U) fsfilt_ldiskfs(U) mgs(U) mgc(U)
ldiskfs(U) crc16(U) nfsd(U) exportfs(U) lockd(U) nfs_acl(U) auth_rpcgss(U) ipmi_devintf(U)
ipmi_si(U) ipmi_msghandler(U) hidp(U) rfcomm(U
) l2cap(U) bluetooth(U) sunrpc(U) lustre(U) lov(U) mdc(U) lquota(U) osc(U) ksocklnd(U) ko2iblnd(U)
ptlrpc(U) obdclass(U) lnet(U) lvfs(U) libcfs(U) rdma_ucm(U) qlgc_vnic(U) ib_sdp(U) rdma_cm(U)
iw_cm(U) ib_addr(U) ib_ipoib(U) ipo
ib_helper(U) ib_cm(U) ib_sa(U) ipv6(U) xfrm_nalgo(U) crypto_api(U) ib_uverbs(U) ib_umad(U)
iw_cxgb3(U) cxgb3(U) ib_ipath(U) ib_mthca(U) mptctl(U) dm_multipath(U) scsi_dh(U) video(U)
backlight(U) sbs(U) i2c_ec(U) button(U) batter
y(U) asus_acpi(U) acpi_memhotplug(U) ac(U) parport_pc(U) lp(U) parport(U) mlx4_ib(U) ib_mad(U)
ib_core(U) joydev(U) mlx4_core(U) i2c_nforce2(U) sr_mod(U) pcspkr(U) ide_cd(U) serio_raw(U)
i2c_core(U) shpchp(U) e1000(U) k8temp(U)
forcedeth(U) hwmon(U) cdrom(U) sg(U) edac_mc(U) dm_snapshot(U) dm_zero(U) dm_mirror(U) dm_log(U)
dm_mod(U) usb_stora
Feb  2 16:08:47 hpc9mds1 kernel: e(U) mptfc(U) scsi_transport_fc(U) mptspi(U) scsi_transport_spi(U)
mptsas(U) mptscsih(U) mptbase(U) scsi_transport_sas(U) sd_mod(U) scsi_mod(U) ext3(U) jbd(U)
uhci_hcd(U) ohci_hcd(U) ehci_hcd(U)
Feb  2 16:08:47 hpc9mds1 kernel: Pid: 25108, comm: ll_mdt_rdpg_12 Tainted: G
2.6.18-128.7.1.el5_lustre.1.8.1.1 #1
Feb  2 16:08:47 hpc9mds1 kernel: RIP: 0010:[<ffffffff886d4b88>]  [<ffffffff886d4b88>]
:obdclass:lustre_hash_for_each_empty+0x218/0x2b0
Feb  2 16:08:47 hpc9mds1 kernel: RSP: 0018:ffff810205055830  EFLAGS: 00000206
Feb  2 16:08:47 hpc9mds1 kernel: RAX: 000000000000ffff RBX: 0000000000001d7e RCX: 0000000000000001
Feb  2 16:08:47 hpc9mds1 kernel: RDX: 000000000000029a RSI: 0000000000000293 RDI: ffffc20005c0c7ec
Feb  2 16:08:47 hpc9mds1 kernel: RBP: ffffffff88753935 R08: 00000000ffffffff R09: 0000000000000020
Feb  2 16:08:47 hpc9mds1 kernel: R10: 0000000000000000 R11: 0000000000000000 R12: ffffffff887555e0
Feb  2 16:08:47 hpc9mds1 kernel: R13: ffff8101f16b9df8 R14: ffff810092b5ae88 R15: ffff8101f16b9dc0
Feb  2 16:08:47 hpc9mds1 kernel: FS:  00002b3759c5c220(0000) GS:ffff810103f99440(0000)
knlGS:00000000f7f9d6c0
Feb  2 16:08:47 hpc9mds1 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
Feb  2 16:08:47 hpc9mds1 kernel: CR2: 00002b3ef0e64240 CR3: 0000000000201000 CR4: 00000000000006e0
Feb  2 16:08:47 hpc9mds1 kernel:
Feb  2 16:08:47 hpc9mds1 kernel: Call Trace:
Feb  2 16:08:47 hpc9mds1 kernel:  [<ffffffff88bf05e1>] :mds:mds_disconnect+0x121/0xe40
Feb  2 16:08:47 hpc9mds1 kernel:  [<ffffffff8003d382>] lock_timer_base+0x1b/0x3c
Feb  2 16:08:47 hpc9mds1 kernel:  [<ffffffff8004aa63>] try_to_del_timer_sync+0x51/0x5a
Feb  2 16:08:48 hpc9mds1 kernel:  [<ffffffff886d7994>] :obdclass:class_fail_export+0x384/0x4c0
Feb  2 16:08:48 hpc9mds1 kernel:  [<ffffffff88bea311>] :mds:mds_readpage+0x1591/0x18c0
Feb  2 16:08:48 hpc9mds1 kernel:  [<ffffffff8001a1b8>] vsnprintf+0x559/0x59e
Feb  2 16:08:48 hpc9mds1 kernel:  [<ffffffff8008a3f3>] default_wake_function+0x0/0xe
Feb  2 16:08:48 hpc9mds1 kernel:  [<ffffffff88bed732>] :mds:mds_handle+0x2172/0x4cd0
Feb  2 16:08:48 hpc9mds1 kernel:  [<ffffffff8863b5d8>] :libcfs:libcfs_ip_addr2str+0x38/0x40
Feb  2 16:08:48 hpc9mds1 kernel:  [<ffffffff8863b9ee>] :libcfs:libcfs_nid2str+0xbe/0x110
Feb  2 16:08:48 hpc9mds1 kernel:  [<ffffffff887a2357>] :ptlrpc:ptlrpc_server_handle_request+0xa97/0x1170
Feb  2 16:08:48 hpc9mds1 kernel:  [<ffffffff8008881d>] __wake_up_common+0x3e/0x68
Feb  2 16:08:48 hpc9mds1 kernel:  [<ffffffff887a5e08>] :ptlrpc:ptlrpc_main+0x1218/0x13e0
Feb  2 16:08:48 hpc9mds1 kernel:  [<ffffffff8008a3f3>] default_wake_function+0x0/0xe
Feb  2 16:08:48 hpc9mds1 kernel:  [<ffffffff800b48f2>] audit_syscall_exit+0x33c/0x357
Feb  2 16:08:48 hpc9mds1 kernel:  [<ffffffff8005dfb1>] child_rip+0xa/0x11
Feb  2 16:08:48 hpc9mds1 kernel:  [<ffffffff887a4bf0>] :ptlrpc:ptlrpc_main+0x0/0x13e0
Feb  2 16:08:48 hpc9mds1 kernel:  [<ffffffff8005dfa7>] child_rip+0x0/0x11
Feb  2 16:08:48 hpc9mds1 kernel:
Feb  2 16:09:01 hpc9mds1 kernel: BUG: soft lockup - CPU#1 stuck for 10s! [ll_mdt_rdpg_12:25108]
Feb  2 16:09:01 hpc9mds1 kernel: CPU 1:
Feb  2 16:09:01 hpc9mds1 kernel: Modules linked in: mds(U) fsfilt_ldiskfs(U) mgs(U) mgc(U)
ldiskfs(U) crc16(U) nfsd(U) exportfs(U) lockd(U) nfs_acl(U) auth_rpcgss(U) ipmi_devintf(U)
ipmi_si(U) ipmi_msghandler(U) hidp(U) rfcomm(U
) l2cap(U) bluetooth(U) sunrpc(U) lustre(U) lov(U) mdc(U) lquota(U) osc(U) ksocklnd(U) ko2iblnd(U)
ptlrpc(U) obdclass(U) lnet(U) lvfs(U) libcfs(U) rdma_ucm(U) qlgc_vnic(U) ib_sdp(U) rdma_cm(U)
iw_cm(U) ib_addr(U) ib_ipoib(U) ipo
ib_helper(U) ib_cm(U) ib_sa(U) ipv6(U) xfrm_nalgo(U) crypto_api(U) ib_uverbs(U) ib_umad(U)
iw_cxgb3(U) cxgb3(U) ib_ipath(U) ib_mthca(U) mptctl(U) dm_multipath(U) scsi_dh(U) video(U)
backlight(U) sbs(U) i2c_ec(U) button(U) batter
y(U) asus_acpi(U) acpi_memhotplug(U) ac(U) parport_pc(U) lp(U) parport(U) mlx4_ib(U) ib_mad(U)
ib_core(U) joydev(U) mlx4_core(U) i2c_nforce2(U) sr_mod(U) pcspkr(U) ide_cd(U) serio_raw(U)
i2c_core(U) shpchp(U) e1000(U) k8temp(U)
forcedeth(U) hwmon(U) cdrom(U) sg(U) edac_mc(U) dm_snapshot(U) dm_zero(U) dm_mirror(U) dm_log(U)
dm_mod(U) usb_stora
Feb  2 16:09:01 hpc9mds1 kernel: e(U) mptfc(U) scsi_transport_fc(U) mptspi(U) scsi_transport_spi(U)
mptsas(U) mptscsih(U) mptbase(U) scsi_transport_sas(U) sd_mod(U) scsi_mod(U) ext3(U) jbd(U)
uhci_hcd(U) ohci_hcd(U) ehci_hcd(U)
Feb  2 16:09:01 hpc9mds1 kernel: Pid: 25108, comm: ll_mdt_rdpg_12 Tainted: G
2.6.18-128.7.1.el5_lustre.1.8.1.1 #1
Feb  2 16:09:01 hpc9mds1 kernel: RIP: 0010:[<ffffffff8006de08>]  [<ffffffff8006de08>]
do_gettimeoffset_hpet+0x7/0x28
Feb  2 16:09:01 hpc9mds1 kernel: RSP: 0018:ffff810205055548  EFLAGS: 00000202
Feb  2 16:09:01 hpc9mds1 kernel: RAX: 00000000a85c775f RBX: ffff810205055580 RCX: 00000000000003c3
Feb  2 16:09:01 hpc9mds1 kernel: RDX: 0000000000000001 RSI: 0000000000010000 RDI: ffff810205055580
Feb  2 16:09:01 hpc9mds1 kernel: RBP: ffffffff8006dc92 R08: 0000000000000000 R09: 00000000000003c3
Feb  2 16:09:01 hpc9mds1 kernel: R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000010
Feb  2 16:09:01 hpc9mds1 kernel: R13: ffff81010f780080 R14: ffff8100a9034330 R15: 0000009a00000000
Feb  2 16:09:01 hpc9mds1 kernel: FS:  00002b3759c5c220(0000) GS:ffff810103f99440(0000)
knlGS:00000000f7f9d6c0
Feb  2 16:09:01 hpc9mds1 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
Feb  2 16:09:01 hpc9mds1 kernel: CR2: 00002b3ef0e64240 CR3: 0000000000201000 CR4: 00000000000006e0
Feb  2 16:09:01 hpc9mds1 kernel:
Feb  2 16:09:01 hpc9mds1 kernel: Call Trace:
Feb  2 16:09:01 hpc9mds1 kernel:  [<ffffffff8006dc92>] do_gettimeofday+0x40/0x8f
Feb  2 16:09:01 hpc9mds1 kernel:  [<ffffffff88636164>] :libcfs:set_ptldebug_header+0x34/0xa0
Feb  2 16:09:01 hpc9mds1 kernel:  [<ffffffff8863e300>] :libcfs:libcfs_debug_vmsg2+0x70/0x990
Feb  2 16:09:01 hpc9mds1 kernel:  [<ffffffff8875086c>] :ptlrpc:_ldlm_lock_debug+0x57c/0x6e0
Feb  2 16:09:01 hpc9mds1 kernel:  [<ffffffff8878a7ef>] :ptlrpc:ptlrpc_set_destroy+0x37f/0x390
Feb  2 16:09:01 hpc9mds1 kernel:  [<ffffffff887555e0>] :ptlrpc:ldlm_cancel_locks_for_export_cb+0x0/0xc0
Feb  2 16:09:01 hpc9mds1 kernel:  [<ffffffff887555e0>] :ptlrpc:ldlm_cancel_locks_for_export_cb+0x0/0xc0
Feb  2 16:09:01 hpc9mds1 kernel:  [<ffffffff887588d5>] :ptlrpc:ldlm_resource_putref+0x395/0x3a0
Feb  2 16:09:01 hpc9mds1 kernel:  [<ffffffff88753703>] :ptlrpc:ldlm_lock_put+0x193/0x3d0
Feb  2 16:09:01 hpc9mds1 kernel:  [<ffffffff88771ecf>] :ptlrpc:ldlm_export_lock_put+0x6f/0xe0
Feb  2 16:09:01 hpc9mds1 kernel:  [<ffffffff8875567e>] :ptlrpc:ldlm_cancel_locks_for_export_cb+0x9e/0xc0
Feb  2 16:09:01 hpc9mds1 kernel:  [<ffffffff886d4b79>] :obdclass:lustre_hash_for_each_empty+0x209/0x2b0
Feb  2 16:09:01 hpc9mds1 kernel:  [<ffffffff88bf05e1>] :mds:mds_disconnect+0x121/0xe40
Feb  2 16:09:01 hpc9mds1 kernel:  [<ffffffff8003d382>] lock_timer_base+0x1b/0x3c
Feb  2 16:09:01 hpc9mds1 kernel:  [<ffffffff8004aa63>] try_to_del_timer_sync+0x51/0x5a
Feb  2 16:09:01 hpc9mds1 kernel:  [<ffffffff886d7994>] :obdclass:class_fail_export+0x384/0x4c0
Feb  2 16:09:01 hpc9mds1 kernel:  [<ffffffff88bea311>] :mds:mds_readpage+0x1591/0x18c0
Feb  2 16:09:01 hpc9mds1 kernel:  [<ffffffff8001a1b8>] vsnprintf+0x559/0x59e
Feb  2 16:09:02 hpc9mds1 kernel:  [<ffffffff8008a3f3>] default_wake_function+0x0/0xe
Feb  2 16:09:02 hpc9mds1 kernel:  [<ffffffff88bed732>] :mds:mds_handle+0x2172/0x4cd0
Feb  2 16:09:02 hpc9mds1 kernel:  [<ffffffff8863b5d8>] :libcfs:libcfs_ip_addr2str+0x38/0x40
Feb  2 16:09:02 hpc9mds1 kernel:  [<ffffffff8863b9ee>] :libcfs:libcfs_nid2str+0xbe/0x110
Feb  2 16:09:02 hpc9mds1 kernel:  [<ffffffff887a2357>] :ptlrpc:ptlrpc_server_handle_request+0xa97/0x1170
Feb  2 16:09:02 hpc9mds1 kernel:  [<ffffffff8008881d>] __wake_up_common+0x3e/0x68
Feb  2 16:09:02 hpc9mds1 kernel:  [<ffffffff887a5e08>] :ptlrpc:ptlrpc_main+0x1218/0x13e0
Feb  2 16:09:02 hpc9mds1 kernel:  [<ffffffff8008a3f3>] default_wake_function+0x0/0xe
Feb  2 16:09:02 hpc9mds1 kernel:  [<ffffffff800b48f2>] audit_syscall_exit+0x33c/0x357
Feb  2 16:09:02 hpc9mds1 kernel:  [<ffffffff8005dfb1>] child_rip+0xa/0x11
Feb  2 16:09:02 hpc9mds1 kernel:  [<ffffffff887a4bf0>] :ptlrpc:ptlrpc_main+0x0/0x13e0
Feb  2 16:09:02 hpc9mds1 kernel:  [<ffffffff8005dfa7>] child_rip+0x0/0x11
Feb  2 16:09:02 hpc9mds1 kernel:
Feb  2 16:09:02 hpc9mds1 kernel: Lustre: 5007:0:(o2iblnd_cb.c:1785:kiblnd_close_conn_locked())
Closing conn to 192.168.60.226 at o2ib: error -116
Feb  2 16:09:09 hpc9mds1 kernel: Lustre: 5007:0:(o2iblnd_cb.c:1785:kiblnd_close_conn_locked())
Closing conn to 192.168.60.226 at o2ib: error -116
Feb  2 16:09:11 hpc9mds1 kernel: BUG: soft lockup - CPU#1 stuck for 10s! [ll_mdt_rdpg_12:25108]
Feb  2 16:09:11 hpc9mds1 kernel: CPU 1:
Feb  2 16:09:11 hpc9mds1 kernel: Modules linked in: mds(U) fsfilt_ldiskfs(U) mgs(U) mgc(U)
ldiskfs(U) crc16(U) nfsd(U) exportfs(U) lockd(U) nfs_acl(U) auth_rpcgss(U) ipmi_devintf(U)
ipmi_si(U) ipmi_msghandler(U) hidp(U) rfcomm(U
) l2cap(U) bluetooth(U) sunrpc(U) lustre(U) lov(U) mdc(U) lquota(U) osc(U) ksocklnd(U) ko2iblnd(U)
ptlrpc(U) obdclass(U) lnet(U) lvfs(U) libcfs(U) rdma_ucm(U) qlgc_vnic(U) ib_sdp(U) rdma_cm(U)
iw_cm(U) ib_addr(U) ib_ipoib(U) ipo
ib_helper(U) ib_cm(U) ib_sa(U) ipv6(U) xfrm_nalgo(U) crypto_api(U) ib_uverbs(U) ib_umad(U)
iw_cxgb3(U) cxgb3(U) ib_ipath(U) ib_mthca(U) mptctl(U) dm_multipath(U) scsi_dh(U) video(U)
backlight(U) sbs(U) i2c_ec(U) button(U) batter
y(U) asus_acpi(U) acpi_memhotplug(U) ac(U) parport_pc(U) lp(U) parport(U) mlx4_ib(U) ib_mad(U)
ib_core(U) joydev(U) mlx4_core(U) i2c_nforce2(U) sr_mod(U) pcspkr(U) ide_cd(U) serio_raw(U)
i2c_core(U) shpchp(U) e1000(U) k8temp(U)
forcedeth(U) hwmon(U) cdrom(U) sg(U) edac_mc(U) dm_snapshot(U) dm_zero(U) dm_mirror(U) dm_log(U)
dm_mod(U) usb_stora
Feb  2 16:09:11 hpc9mds1 kernel: e(U) mptfc(U) scsi_transport_fc(U) mptspi(U) scsi_transport_spi(U)
mptsas(U) mptscsih(U) mptbase(U) scsi_transport_sas(U) sd_mod(U) scsi_mod(U) ext3(U) jbd(U)
uhci_hcd(U) ohci_hcd(U) ehci_hcd(U)
Feb  2 16:09:11 hpc9mds1 kernel: Pid: 25108, comm: ll_mdt_rdpg_12 Tainted: G
2.6.18-128.7.1.el5_lustre.1.8.1.1 #1
Feb  2 16:09:11 hpc9mds1 kernel: RIP: 0010:[<ffffffff80064aee>]  [<ffffffff80064aee>]
_write_lock+0x7/0xf
Feb  2 16:09:11 hpc9mds1 kernel: RSP: 0018:ffff810205055828  EFLAGS: 00000246
Feb  2 16:09:11 hpc9mds1 kernel: RAX: 000000000000ffff RBX: 0000000000000c49 RCX: 0000000000000001
Feb  2 16:09:11 hpc9mds1 kernel: RDX: 0000000000000ec0 RSI: 0000000000000293 RDI: ffffc20005bfb49c
Feb  2 16:09:11 hpc9mds1 kernel: RBP: ffff8101f16b9df8 R08: 00000000ffffffff R09: 0000000000000020
Feb  2 16:09:11 hpc9mds1 kernel: R10: 0000000000000000 R11: 0000000000000000 R12: ffff8100c6087e88
Feb  2 16:09:11 hpc9mds1 kernel: R13: ffff8101f16b9dc0 R14: ffff8100c6086380 R15: ffff8100c6087e00
Feb  2 16:09:11 hpc9mds1 kernel: FS:  00002b3759c5c220(0000) GS:ffff810103f99440(0000)
knlGS:00000000f7f9d6c0
Feb  2 16:09:11 hpc9mds1 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
Feb  2 16:09:11 hpc9mds1 kernel: CR2: 00002b3ef0e64240 CR3: 0000000000201000 CR4: 00000000000006e0
Feb  2 16:09:11 hpc9mds1 kernel:
Feb  2 16:09:11 hpc9mds1 kernel: Call Trace:
Feb  2 16:09:11 hpc9mds1 kernel:  [<ffffffff886d4ba7>] :obdclass:lustre_hash_for_each_empty+0x237/0x2b0
Feb  2 16:09:11 hpc9mds1 kernel:  [<ffffffff88bf05e1>] :mds:mds_disconnect+0x121/0xe40
Feb  2 16:09:11 hpc9mds1 kernel:  [<ffffffff8003d382>] lock_timer_base+0x1b/0x3c
Feb  2 16:09:11 hpc9mds1 kernel:  [<ffffffff8004aa63>] try_to_del_timer_sync+0x51/0x5a
Feb  2 16:09:12 hpc9mds1 kernel:  [<ffffffff886d7994>] :obdclass:class_fail_export+0x384/0x4c0
Feb  2 16:09:12 hpc9mds1 kernel:  [<ffffffff88bea311>] :mds:mds_readpage+0x1591/0x18c0
Feb  2 16:09:12 hpc9mds1 kernel:  [<ffffffff8001a1b8>] vsnprintf+0x559/0x59e
Feb  2 16:09:12 hpc9mds1 kernel:  [<ffffffff8008a3f3>] default_wake_function+0x0/0xe
Feb  2 16:09:12 hpc9mds1 kernel:  [<ffffffff88bed732>] :mds:mds_handle+0x2172/0x4cd0
Feb  2 16:09:12 hpc9mds1 kernel:  [<ffffffff8863b5d8>] :libcfs:libcfs_ip_addr2str+0x38/0x40
Feb  2 16:09:12 hpc9mds1 kernel:  [<ffffffff8863b9ee>] :libcfs:libcfs_nid2str+0xbe/0x110
Feb  2 16:09:12 hpc9mds1 kernel:  [<ffffffff887a2357>] :ptlrpc:ptlrpc_server_handle_request+0xa97/0x1170
Feb  2 16:09:12 hpc9mds1 kernel:  [<ffffffff8008881d>] __wake_up_common+0x3e/0x68
Feb  2 16:09:12 hpc9mds1 kernel:  [<ffffffff887a5e08>] :ptlrpc:ptlrpc_main+0x1218/0x13e0
Feb  2 16:09:12 hpc9mds1 kernel:  [<ffffffff8008a3f3>] default_wake_function+0x0/0xe
Feb  2 16:09:12 hpc9mds1 kernel:  [<ffffffff800b48f2>] audit_syscall_exit+0x33c/0x357
Feb  2 16:09:12 hpc9mds1 kernel:  [<ffffffff8005dfb1>] child_rip+0xa/0x11
Feb  2 16:09:12 hpc9mds1 kernel:  [<ffffffff887a4bf0>] :ptlrpc:ptlrpc_main+0x0/0x13e0
Feb  2 16:09:12 hpc9mds1 kernel:  [<ffffffff8005dfa7>] child_rip+0x0/0x11
Feb  2 16:09:12 hpc9mds1 kernel:
Feb  2 16:09:27 hpc9mds1 kernel: Lustre: 5007:0:(o2iblnd_cb.c:1785:kiblnd_close_conn_locked())
Closing conn to 192.168.60.226 at o2ib: error -116
Feb  2 16:09:34 hpc9mds1 kernel: Lustre: 5007:0:(o2iblnd_cb.c:1785:kiblnd_close_conn_locked())
Closing conn to 192.168.60.226 at o2ib: error -116
--------------------------------------------

The OSS1 shows this:
--------------------------------------------
Feb  2 15:58:37 hpc9oss1 kernel: Lustre: 7938:0:(o2iblnd_cb.c:1785:kiblnd_close_conn_locked())
Closing conn to 192.168.60.226 at o2ib: error -116
Feb  2 15:59:02 hpc9oss1 last message repeated 3 times
Feb  2 15:59:27 hpc9oss1 kernel: Lustre: 7936:0:(o2iblnd_cb.c:1785:kiblnd_close_conn_locked())
Closing conn to 192.168.60.226 at o2ib: error -116
Feb  2 15:59:59 hpc9oss1 last message repeated 3 times
Feb  2 16:01:07 hpc9oss1 last message repeated 5 times
Feb  2 16:01:07 hpc9oss1 kernel: Lustre: 7936:0:(o2iblnd_cb.c:1785:kiblnd_close_conn_locked())
Closing conn to 192.168.60.226 at o2ib: error -116
Feb  2 16:01:32 hpc9oss1 kernel: Lustre: 7941:0:(o2iblnd_cb.c:1785:kiblnd_close_conn_locked())
Closing conn to 192.168.60.226 at o2ib: error -116
Feb  2 16:01:39 hpc9oss1 kernel: Lustre: 7938:0:(o2iblnd_cb.c:1785:kiblnd_close_conn_locked())
Closing conn to 192.168.60.226 at o2ib: error -116
Feb  2 16:02:22 hpc9oss1 last message repeated 3 times
Feb  2 16:02:22 hpc9oss1 kernel: Lustre: 7938:0:(o2iblnd_cb.c:1785:kiblnd_close_conn_locked())
Closing conn to 192.168.60.226 at o2ib: error -116
Feb  2 16:02:47 hpc9oss1 kernel: Lustre: 7941:0:(o2iblnd_cb.c:1785:kiblnd_close_conn_locked())
Closing conn to 192.168.60.226 at o2ib: error -116
Feb  2 16:02:48 hpc9oss1 kernel: Lustre: 7941:0:(o2iblnd_cb.c:1785:kiblnd_close_conn_locked())
Closing conn to 192.168.60.226 at o2ib: error -116
Feb  2 16:03:12 hpc9oss1 kernel: Lustre: 7938:0:(o2iblnd_cb.c:1785:kiblnd_close_conn_locked())
Closing conn to 192.168.60.226 at o2ib: error -116
Feb  2 16:03:38 hpc9oss1 last message repeated 3 times
Feb  2 16:04:02 hpc9oss1 kernel: Lustre: 7937:0:(o2iblnd_cb.c:1785:kiblnd_close_conn_locked())
Closing conn to 192.168.60.226 at o2ib: error -116
Feb  2 16:04:04 hpc9oss1 kernel: Lustre: 7937:0:(o2iblnd_cb.c:1785:kiblnd_close_conn_locked())
Closing conn to 192.168.60.226 at o2ib: error -116
Feb  2 16:04:21 hpc9oss1 kernel: Lustre: 10054:0:(ldlm_lib.c:541:target_handle_reconnect())
cfd1-OST0002: b80f9710-f288-9c2a-ee70-16c25f8ca97d reconnecting
Feb  2 16:04:21 hpc9oss1 kernel: Lustre: 10054:0:(ldlm_lib.c:541:target_handle_reconnect()) Skipped
1 previous similar message
Feb  2 16:04:27 hpc9oss1 kernel: Lustre: 7937:0:(o2iblnd_cb.c:1785:kiblnd_close_conn_locked())
Closing conn to 192.168.60.226 at o2ib: error -116
Feb  2 16:05:17 hpc9oss1 last message repeated 4 times
Feb  2 16:06:07 hpc9oss1 last message repeated 5 times
Feb  2 16:06:32 hpc9oss1 kernel: Lustre: 7938:0:(o2iblnd_cb.c:1785:kiblnd_close_conn_locked())
Closing conn to 192.168.60.226 at o2ib: error -116
Feb  2 16:06:57 hpc9oss1 last message repeated 2 times
Feb  2 16:06:57 hpc9oss1 kernel: Lustre: 7938:0:(o2iblnd_cb.c:2232:kiblnd_passive_connect()) Conn
race 192.168.60.226 at o2ib
Feb  2 16:06:57 hpc9oss1 kernel: Lustre: 7938:0:(o2iblnd_cb.c:2232:kiblnd_passive_connect()) Skipped
1 previous similar message
Feb  2 16:07:06 hpc9oss1 kernel: Lustre: 7938:0:(o2iblnd_cb.c:1785:kiblnd_close_conn_locked())
Closing conn to 192.168.60.226 at o2ib: error -116
Feb  2 16:07:22 hpc9oss1 kernel: Lustre: 7942:0:(o2iblnd_cb.c:1785:kiblnd_close_conn_locked())
Closing conn to 192.168.60.226 at o2ib: error -116
Feb  2 16:07:29 hpc9oss1 kernel: Lustre: 7937:0:(o2iblnd_cb.c:1785:kiblnd_close_conn_locked())
Closing conn to 192.168.60.226 at o2ib: error -116
Feb  2 16:08:19 hpc9oss1 last message repeated 2 times
Feb  2 16:08:19 hpc9oss1 kernel: Lustre: 7937:0:(o2iblnd_cb.c:2220:kiblnd_passive_connect()) Conn
stale 192.168.60.226 at o2ib [old ver: 12, new ver: 12]
Feb  2 16:08:19 hpc9oss1 kernel: Lustre: 7937:0:(o2iblnd_cb.c:2220:kiblnd_passive_connect()) Skipped
46 previous similar messages
Feb  2 16:08:37 hpc9oss1 kernel: Lustre: 7937:0:(o2iblnd_cb.c:1785:kiblnd_close_conn_locked())
Closing conn to 192.168.60.226 at o2ib: error -116
Feb  2 16:09:09 hpc9oss1 last message repeated 3 times
Feb  2 16:09:34 hpc9oss1 last message repeated 2 times
Feb  2 16:09:52 hpc9oss1 kernel: Lustre: 7941:0:(o2iblnd_cb.c:1785:kiblnd_close_conn_locked())
Closing conn to 192.168.60.226 at o2ib: error -116
Feb  2 16:09:59 hpc9oss1 kernel: Lustre: 7937:0:(o2iblnd_cb.c:1785:kiblnd_close_conn_locked())
Closing conn to 192.168.60.226 at o2ib: error -116
Feb  2 16:10:42 hpc9oss1 last message repeated 3 times
--------------------------------------------

The OSS2 shows this:
--------------------------------------------
Feb  2 15:58:37 hpc9oss2 kernel: Lustre: 7947:0:(o2iblnd_cb.c:1785:kiblnd_close_conn_locked())
Closing conn to 192.168.60.226 at o2ib: error -116
Feb  2 15:58:37 hpc9oss2 kernel: Lustre: 7947:0:(o2iblnd_cb.c:1785:kiblnd_close_conn_locked())
Closing conn to 192.168.60.226 at o2ib: error -116
Feb  2 15:59:02 hpc9oss2 kernel: Lustre: 7942:0:(o2iblnd_cb.c:1785:kiblnd_close_conn_locked())
Closing conn to 192.168.60.226 at o2ib: error -116
Feb  2 15:59:52 hpc9oss2 last message repeated 4 times
Feb  2 16:00:42 hpc9oss2 last message repeated 5 times
Feb  2 16:01:07 hpc9oss2 kernel: Lustre: 7948:0:(o2iblnd_cb.c:1785:kiblnd_close_conn_locked())
Closing conn to 192.168.60.226 at o2ib: error -116
Feb  2 16:01:07 hpc9oss2 kernel: Lustre: 7948:0:(o2iblnd_cb.c:1785:kiblnd_close_conn_locked())
Closing conn to 192.168.60.226 at o2ib: error -116
Feb  2 16:01:32 hpc9oss2 kernel: Lustre: 7944:0:(o2iblnd_cb.c:1785:kiblnd_close_conn_locked())
Closing conn to 192.168.60.226 at o2ib: error -116
Feb  2 16:02:04 hpc9oss2 last message repeated 3 times
Feb  2 16:02:04 hpc9oss2 kernel: Lustre: 10237:0:(ldlm_lib.c:541:target_handle_reconnect())
cfd1-OST0007: b80f9710-f288-9c2a-ee70-16c25f8ca97d reconnecting
Feb  2 16:02:22 hpc9oss2 kernel: Lustre: 7944:0:(o2iblnd_cb.c:1785:kiblnd_close_conn_locked())
Closing conn to 192.168.60.226 at o2ib: error -116
Feb  2 16:03:12 hpc9oss2 last message repeated 4 times
Feb  2 16:04:27 hpc9oss2 last message repeated 6 times
Feb  2 16:05:42 hpc9oss2 last message repeated 6 times
Feb  2 16:06:32 hpc9oss2 last message repeated 5 times
Feb  2 16:06:57 hpc9oss2 kernel: Lustre: 7947:0:(o2iblnd_cb.c:1785:kiblnd_close_conn_locked())
Closing conn to 192.168.60.226 at o2ib: error -116
Feb  2 16:07:04 hpc9oss2 last message repeated 2 times
Feb  2 16:07:04 hpc9oss2 kernel: Lustre: 10027:0:(ldlm_lib.c:541:target_handle_reconnect())
cfd1-OST0004: b80f9710-f288-9c2a-ee70-16c25f8ca97d reconnecting
Feb  2 16:07:22 hpc9oss2 kernel: Lustre: 7944:0:(o2iblnd_cb.c:1785:kiblnd_close_conn_locked())
Closing conn to 192.168.60.226 at o2ib: error -116
Feb  2 16:08:07 hpc9oss2 last message repeated 3 times
Feb  2 16:08:07 hpc9oss2 kernel: Lustre: 7944:0:(o2iblnd_cb.c:2220:kiblnd_passive_connect()) Conn
stale 192.168.60.226 at o2ib [old ver: 12, new ver: 12]
Feb  2 16:08:07 hpc9oss2 kernel: Lustre: 7944:0:(o2iblnd_cb.c:2220:kiblnd_passive_connect()) Skipped
47 previous similar messages
Feb  2 16:08:12 hpc9oss2 kernel: Lustre: 7944:0:(o2iblnd_cb.c:1785:kiblnd_close_conn_locked())
Closing conn to 192.168.60.226 at o2ib: error -116
Feb  2 16:09:02 hpc9oss2 last message repeated 4 times
Feb  2 16:09:09 hpc9oss2 kernel: Lustre: 7944:0:(o2iblnd_cb.c:1785:kiblnd_close_conn_locked())
Closing conn to 192.168.60.226 at o2ib: error -116
Feb  2 16:09:27 hpc9oss2 kernel: Lustre: 7943:0:(o2iblnd_cb.c:1785:kiblnd_close_conn_locked())
Closing conn to 192.168.60.226 at o2ib: error -116
Feb  2 16:09:59 hpc9oss2 last message repeated 3 times
--------------------------------------------

What's the cause of those "Rx from 192.168.60.240 at o2ib failed: 5"? Any ideas what's going on here?

Thanks,

Ulrich Sibiller



-- 
Vorstand/Board of Management:
Dr. Bernd Finkbeiner, Dr. Roland Niemeier, 
Dr. Arno Steitz, Dr. Ingrid Zech
Vorsitzender des Aufsichtsrats/
Chairman of the Supervisory Board:
Michel Lepert
Sitz/Registered Office: Tuebingen
Registergericht/Registration Court: Stuttgart
Registernummer/Commercial Register No.: HRB 382196 





More information about the lustre-discuss mailing list