[Lustre-discuss] Client evictions and RMDA failures
syed haider
syed.haider at gmail.com
Tue Mar 31 07:29:59 PDT 2009
Dear lustre group,
I'm hoping you can help with this problem. My configuration is as follows:
4 OSS's | 1 MDS/MGS | n # nodes
RPM's installed on CentOS 5.2 systems:
lustre-1.6.6-2.6.18_92.1.10.el5_lustre.1.6.6smp
kernel-ib-1.3.1-2.6.18_92.1.10.el5_lustre.1.6.6smp
lustre-modules-1.6.6-2.6.18_92.1.10.el5_lustre.1.6.6smp
kernel-lustre-smp-2.6.18-92.1.10.el5_lustre.1.6.6
I'm able to start lustre on all OSS's and MDS/MGS and mount to clients
successfully. But eventually the lustre mount hangs (df hangs) on the
clients. Initially I though it may be a fabric problem with ib but I
see no errors on the switch and all cables are attached securely. The
hanging issue is very random, some nodes will stay up for days and
some hang after a couple of hours, but inevitably all nodes hang.
when a node
> hangs, it is unable to do an lctl ping to a OSS. For example, node-0-6
> is hanging. From this node I can do an lctl ping to
> oss-0-0, oss-0-2 and oss-0-3. Lctl ping to oss-0-1 just hangs. And if do
> the same from oss-0-1 to node-0-6 I get the following error message:
>
> [root at tiger-oss-0-1 ~]# lctl ping 192.255.255.220 at o2ib
> failed to ping 192.255.255.220 at o2ib: Input/output error
>
> Interestingly enough the oss can ping any other node:
>
> [root at tiger-oss-0-1 ~]# lctl ping 192.255.255.222 at o2ib
> 12345-0 at lo
> 12345-192.255.255.222 at o2ib
>
> And the node can ping any other system:
>
> [root at tiger-node-0-6 ~]# lctl ping 192.255.255.253 at o2ib
> 12345-0 at lo
> 12345-192.255.255.253 at o2ib
>
> Only the communication between the two is broken.
>
> The only messages from oss-0-1 related to node-0-6 are these:
>
> [root at tiger-oss-0-1 ~]# cat /var/log/messages |grep 192.255.255.220
> Mar 26 04:22:26 tiger-oss-0-1 kernel: Lustre: lustre-OST0008: haven't
> heard from client d17b6a66-9ba9-18a9-e706-8fa35ad18119 (at
> 192.255.255.220 at o2ib) in 227 seconds. I think it's dead, and I am
> evicting it.
> Mar 26 04:22:26 tiger-oss-0-1 kernel: Lustre: lustre-OST000d: haven't
> heard from client d17b6a66-9ba9-18a9-e706-8fa35ad18119 (at
> 192.255.255.220 at o2ib) in 227 seconds. I think it's dead, and I am
> evicting it.
>
> Messages from node-0-6 related to oss-0-1:
>
> [root at tiger-node-0-6 ~]# cat /var/log/messages |grep 192.255.255.252
> Mar 25 18:36:01 tiger-node-0-6 kernel: LustreError:
> 4482:0:(o2iblnd_cb.c:2891:kiblnd_check_conns()) Timed out RDMA with
> 192.255.255.252 at o2ib
> Mar 25 18:36:01 tiger-node-0-6 kernel: LustreError:
> 4482:0:(events.c:66:request_out_callback()) @@@ type 4, status -103
> req at ffff81006ea43a00 x2646/t0
> o400->lustre-OST000f_UUID at 192.255.255.252@o2ib:28/4 lens 128/256 e 0 to
> 100 dl 1238020602 ref 2 fl Rpc:N/0/0 rc 0/0
> Mar 25 18:36:01 tiger-node-0-6 kernel: Lustre: Request x2644 sent from
> lustre-OST000d-osc-ffff81007f555000 to NID 192.255.255.252 at o2ib 59s ago
> has timed out (limit 100s).
> Mar 25 18:36:01 tiger-node-0-6 kernel: Lustre:
> lustre-OST000d-osc-ffff81007f555000: Connection to service
> lustre-OST000d via nid 192.255.255.252 at o2ib was lost; in progress
> operations using this service will wait for recovery to complete.
> Mar 25 18:36:01 tiger-node-0-6 kernel: Lustre:
> lustre-OST000e-osc-ffff81007f555000: Connection restored to service
> lustre-OST000e using nid 192.255.255.252 at o2ib.
> Mar 25 18:36:01 tiger-node-0-6 kernel: Lustre:
> lustre-OST000d-osc-ffff81007f555000: Connection restored to service
> lustre-OST000d using nid 192.255.255.252 at o2ib.
> Mar 26 04:19:35 tiger-node-0-6 kernel: LustreError:
> 4482:0:(o2iblnd_cb.c:2891:kiblnd_check_conns()) Timed out RDMA with
> 192.255.255.252 at o2ib
> Mar 26 04:20:20 tiger-node-0-6 kernel: Lustre: Request x50261 sent from
> lustre-OST000d-osc-ffff81007f555000 to NID 192.255.255.252 at o2ib 101s ago
> has timed out (limit 100s).
> Mar 26 04:20:20 tiger-node-0-6 kernel: Lustre:
> lustre-OST000d-osc-ffff81007f555000: Connection to service
> lustre-OST000d via nid 192.255.255.252 at o2ib was lost; in progress
> operations using this service will wait for recovery to complete.
> Mar 26 04:20:44 tiger-node-0-6 kernel: Lustre: Request x50290 sent from
> lustre-OST0008-osc-ffff81007f555000 to NID 192.255.255.252 at o2ib 100s ago
> has timed out (limit 100s).
> Mar 26 04:20:44 tiger-node-0-6 kernel: Lustre:
> lustre-OST0008-osc-ffff81007f555000: Connection to service
> lustre-OST0008 via nid 192.255.255.252 at o2ib was lost; in progress
> operations using this service will wait for recovery to complete.
> Mar 26 04:21:10 tiger-node-0-6 kernel: Lustre: Request x50324 sent from
> lustre-OST0008-osc-ffff81007f555000 to NID 192.255.255.252 at o2ib 101s ago
> has timed out (limit 100s).
> Mar 26 04:21:35 tiger-node-0-6 kernel: Lustre: Request x50358 sent from
> lustre-OST0008-osc-ffff81007f555000 to NID 192.255.255.252 at o2ib 101s ago
> has timed out (limit 100s).
> Mar 26 04:22:00 tiger-node-0-6 kernel: Lustre: Request x50416 sent from
> lustre-OST000d-osc-ffff81007f555000 to NID 192.255.255.252 at o2ib 100s ago
> has timed out (limit 100s).
> Mar 26 04:22:24 tiger-node-0-6 kernel: Lustre: Request x50429 sent from
> lustre-OST0008-osc-ffff81007f555000 to NID 192.255.255.252 at o2ib 100s ago
> has timed out (limit 100s).
> Mar 26 04:24:04 tiger-node-0-6 kernel: Lustre: Request x50538 sent from
> lustre-OST000d-osc-ffff81007f555000 to NID 192.255.255.252 at o2ib 100s ago
> has timed out (limit 100s).
> Mar 26 04:24:30 tiger-node-0-6 kernel: Lustre: Request x50567 sent from
> lustre-OST0008-osc-ffff81007f555000 to NID 192.255.255.252 at o2ib 101s ago
> has timed out (limit 100s).
> Mar 26 04:26:09 tiger-node-0-6 kernel: Lustre: Request x50676 sent from
> lustre-OST000d-osc-ffff81007f555000 to NID 192.255.255.252 at o2ib 100s ago
> has timed out (limit 100s).
> Mar 26 04:28:14 tiger-node-0-6 kernel: Lustre: Request x50814 sent from
> lustre-OST000d-osc-ffff81007f555000 to NID 192.255.255.252 at o2ib 100s ago
> has timed out (limit 100s).
> Mar 26 04:30:44 tiger-node-0-6 kernel: Lustre: Request x50981 sent from
> lustre-OST0008-osc-ffff81007f555000 to NID 192.255.255.252 at o2ib 100s ago
> has timed out (limit 100s).
> Mar 26 04:36:33 tiger-node-0-6 kernel: Lustre: Request x51366 sent from
> lustre-OST000d-osc-ffff81007f555000 to NID 192.255.255.252 at o2ib 100s ago
> has timed out (limit 100s).
> Mar 26 04:45:18 tiger-node-0-6 kernel: Lustre: Request x51947 sent from
> lustre-OST0008-osc-ffff81007f555000 to NID 192.255.255.252 at o2ib 100s ago
> has timed out (limit 100s).
> Mar 26 04:55:43 tiger-node-0-6 kernel: Lustre: Request x52637 sent from
> lustre-OST0008-osc-ffff81007f555000 to NID 192.255.255.252 at o2ib 100s ago
> has timed out (limit 100s).
> Mar 26 05:06:08 tiger-node-0-6 kernel: Lustre: Request x53327 sent from
> lustre-OST0008-osc-ffff81007f555000 to NID 192.255.255.252 at o2ib 100s ago
> has timed out (limit 100s).
> Mar 26 05:16:33 tiger-node-0-6 kernel: Lustre: Request x54017 sent from
> lustre-OST0008-osc-ffff81007f555000 to NID 192.255.255.252 at o2ib 100s ago
> has timed out (limit 100s).
> Mar 26 05:26:58 tiger-node-0-6 kernel: Lustre: Request x54707 sent from
> lustre-OST0008-osc-ffff81007f555000 to NID 192.255.255.252 at o2ib 100s ago
> has timed out (limit 100s).
> Mar 26 05:37:23 tiger-node-0-6 kernel: Lustre: Request x55397 sent from
> lustre-OST0008-osc-ffff81007f555000 to NID 192.255.255.252 at o2ib 100s ago
> has timed out (limit 100s).
> Mar 26 05:47:47 tiger-node-0-6 kernel: Lustre: Request x56087 sent from
> lustre-OST0008-osc-ffff81007f555000 to NID 192.255.255.252 at o2ib 100s ago
> has timed out (limit 100s).
> Mar 26 05:58:12 tiger-node-0-6 kernel: Lustre: Request x56777 sent from
> lustre-OST0008-osc-ffff81007f555000 to NID 192.255.255.252 at o2ib 100s ago
> has timed out (limit 100s).
> Mar 26 06:08:37 tiger-node-0-6 kernel: Lustre: Request x57467 sent from
> lustre-OST0008-osc-ffff81007f555000 to NID 192.255.255.252 at o2ib 100s ago
> has timed out (limit 100s).
> Mar 26 06:19:02 tiger-node-0-6 kernel: Lustre: Request x58157 sent from
> lustre-OST0008-osc-ffff81007f555000 to NID 192.255.255.252 at o2ib 100s ago
> has timed out (limit 100s).
> Mar 26 06:29:27 tiger-node-0-6 kernel: Lustre: Request x58847 sent from
> lustre-OST0008-osc-ffff81007f555000 to NID 192.255.255.252 at o2ib 100s ago
> has timed out (limit 100s).
> Mar 26 06:39:52 tiger-node-0-6 kernel: Lustre: Request x59537 sent from
> lustre-OST0008-osc-ffff81007f555000 to NID 192.255.255.252 at o2ib 100s ago
> has timed out (limit 100s).
> Mar 26 06:50:17 tiger-node-0-6 kernel: Lustre: Request x60227 sent from
> lustre-OST0008-osc-ffff81007f555000 to NID 192.255.255.252 at o2ib 100s ago
> has timed out (limit 100s).
> Mar 26 07:00:41 tiger-node-0-6 kernel: Lustre: Request x60917 sent from
> lustre-OST0008-osc-ffff81007f555000 to NID 192.255.255.252 at o2ib 100s ago
> has timed out (limit 100s).
> Mar 26 07:11:06 tiger-node-0-6 kernel: Lustre: Request x61607 sent from
> lustre-OST0008-osc-ffff81007f555000 to NID 192.255.255.252 at o2ib 100s ago
> has timed out (limit 100s).
> Mar 26 07:21:31 tiger-node-0-6 kernel: Lustre: Request x62271 sent from
> lustre-OST0008-osc-ffff81007f555000 to NID 192.255.255.252 at o2ib 100s ago
> has timed out (limit 100s).
> Mar 26 07:31:56 tiger-node-0-6 kernel: Lustre: Request x62961 sent from
> lustre-OST0008-osc-ffff81007f555000 to NID 192.255.255.252 at o2ib 100s ago
> has timed out (limit 100s).
> Mar 26 07:42:21 tiger-node-0-6 kernel: Lustre: Request x63651 sent from
> lustre-OST0008-osc-ffff81007f555000 to NID 192.255.255.252 at o2ib 100s ago
> has timed out (limit 100s).
> Mar 26 07:52:46 tiger-node-0-6 kernel: Lustre: Request x64341 sent from
> lustre-OST0008-osc-ffff81007f555000 to NID 192.255.255.252 at o2ib 100s ago
> has timed out (limit 100s).
> Mar 26 08:03:11 tiger-node-0-6 kernel: Lustre: Request x65031 sent from
> lustre-OST0008-osc-ffff81007f555000 to NID 192.255.255.252 at o2ib 100s ago
> has timed out (limit 100s).
> Mar 26 08:13:36 tiger-node-0-6 kernel: Lustre: Request x65721 sent from
> lustre-OST0008-osc-ffff81007f555000 to NID 192.255.255.252 at o2ib 100s ago
> has timed out (limit 100s).
> Mar 26 08:24:00 tiger-node-0-6 kernel: Lustre: Request x66411 sent from
> lustre-OST0008-osc-ffff81007f555000 to NID 192.255.255.252 at o2ib 100s ago
> has timed out (limit 100s).
> Mar 26 08:34:25 tiger-node-0-6 kernel: Lustre: Request x67101 sent from
> lustre-OST0008-osc-ffff81007f555000 to NID 192.255.255.252 at o2ib 100s ago
> has timed out (limit 100s).
> Mar 26 08:44:50 tiger-node-0-6 kernel: Lustre: Request x67791 sent from
> lustre-OST0008-osc-ffff81007f555000 to NID 192.255.255.252 at o2ib 100s ago
> has timed out (limit 100s).
> Mar 26 08:55:15 tiger-node-0-6 kernel: Lustre: Request x68481 sent from
> lustre-OST0008-osc-ffff81007f555000 to NID 192.255.255.252 at o2ib 100s ago
> has timed out (limit 100s).
> Mar 26 09:05:40 tiger-node-0-6 kernel: Lustre: Request x69171 sent from
> lustre-OST0008-osc-ffff81007f555000 to NID 192.255.255.252 at o2ib 100s ago
> has timed out (limit 100s).
> Mar 26 09:16:05 tiger-node-0-6 kernel: Lustre: Request x69861 sent from
> lustre-OST0008-osc-ffff81007f555000 to NID 192.255.255.252 at o2ib 100s ago
> has timed out (limit 100s).
> Mar 26 09:26:30 tiger-node-0-6 kernel: Lustre: Request x70551 sent from
> lustre-OST0008-osc-ffff81007f555000 to NID 192.255.255.252 at o2ib 100s ago
> has timed out (limit 100s).
> Mar 26 09:36:54 tiger-node-0-6 kernel: Lustre: Request x71249 sent from
> lustre-OST0008-osc-ffff81007f555000 to NID 192.255.255.252 at o2ib 100s ago
> has timed out (limit 100s).
> Mar 26 09:38:43 tiger-node-0-6 kernel: Lustre:
> 8821:0:(api-ni.c:1685:lnet_ping()) ping 12345-192.255.255.252 at o2ib: late
> network completion
> Mar 26 09:39:43 tiger-node-0-6 kernel: Lustre:
> 8821:0:(api-ni.c:1685:lnet_ping()) ping 12345-192.255.255.252 at o2ib: late
> network completion
> Mar 26 09:40:43 tiger-node-0-6 kernel: Lustre:
> 8821:0:(api-ni.c:1685:lnet_ping()) ping 12345-192.255.255.252 at o2ib: late
> network completion
> Mar 26 09:41:43 tiger-node-0-6 kernel: Lustre:
> 8821:0:(api-ni.c:1685:lnet_ping()) ping 12345-192.255.255.252 at o2ib: late
> network completion
> Mar 26 09:42:43 tiger-node-0-6 kernel: Lustre:
> 8821:0:(api-ni.c:1685:lnet_ping()) ping 12345-192.255.255.252 at o2ib: late
> network completion
> Mar 26 09:47:19 tiger-node-0-6 kernel: Lustre: Request x71939 sent from
> lustre-OST0008-osc-ffff81007f555000 to NID 192.255.255.252 at o2ib 100s ago
> has timed out (limit 100s).
> Mar 26 09:48:43 tiger-node-0-6 kernel: Lustre:
> 8821:0:(api-ni.c:1685:lnet_ping()) ping 12345-192.255.255.252 at o2ib: late
> network completion
More information about the lustre-discuss
mailing list