[Lustre-discuss] Client evictions and RMDA failures

syed haider syed.haider at gmail.com
Tue Mar 31 07:29:59 PDT 2009


Dear lustre group,

I'm hoping you can help with this problem. My configuration is as follows:

4 OSS's | 1 MDS/MGS | n # nodes

RPM's installed on CentOS 5.2 systems:

lustre-1.6.6-2.6.18_92.1.10.el5_lustre.1.6.6smp
kernel-ib-1.3.1-2.6.18_92.1.10.el5_lustre.1.6.6smp
lustre-modules-1.6.6-2.6.18_92.1.10.el5_lustre.1.6.6smp
kernel-lustre-smp-2.6.18-92.1.10.el5_lustre.1.6.6

I'm able to start lustre on all OSS's and MDS/MGS and mount to clients
successfully. But eventually the lustre mount hangs (df hangs) on the
clients. Initially I though it may be a fabric problem with ib but I
see no errors on the switch and all cables are attached securely. The
hanging issue is very random, some nodes will stay up for days and
some hang after a couple of hours, but inevitably all nodes hang.


when a node

> hangs, it is unable to do an lctl ping to a OSS. For example, node-0-6

> is hanging. From this node I can do an lctl ping to

> oss-0-0, oss-0-2 and oss-0-3. Lctl ping to oss-0-1 just hangs. And if do

> the same from oss-0-1 to node-0-6 I get the following error message:

>

> [root at tiger-oss-0-1 ~]# lctl ping 192.255.255.220 at o2ib

> failed to ping 192.255.255.220 at o2ib: Input/output error

>

> Interestingly enough the oss can ping any other node:

>

> [root at tiger-oss-0-1 ~]# lctl ping 192.255.255.222 at o2ib

> 12345-0 at lo

> 12345-192.255.255.222 at o2ib

>

> And the node can ping any other system:

>

> [root at tiger-node-0-6 ~]# lctl ping 192.255.255.253 at o2ib

> 12345-0 at lo

> 12345-192.255.255.253 at o2ib

>

> Only the communication between the two is broken.

>

> The only messages from oss-0-1 related to node-0-6 are these:

>

> [root at tiger-oss-0-1 ~]# cat /var/log/messages |grep 192.255.255.220

> Mar 26 04:22:26 tiger-oss-0-1 kernel: Lustre: lustre-OST0008: haven't

> heard from client d17b6a66-9ba9-18a9-e706-8fa35ad18119 (at

> 192.255.255.220 at o2ib) in 227 seconds. I think it's dead, and I am

> evicting it.

> Mar 26 04:22:26 tiger-oss-0-1 kernel: Lustre: lustre-OST000d: haven't

> heard from client d17b6a66-9ba9-18a9-e706-8fa35ad18119 (at

> 192.255.255.220 at o2ib) in 227 seconds. I think it's dead, and I am

> evicting it.

>

> Messages from node-0-6 related to oss-0-1:

>

> [root at tiger-node-0-6 ~]# cat /var/log/messages |grep 192.255.255.252

> Mar 25 18:36:01 tiger-node-0-6 kernel: LustreError:

> 4482:0:(o2iblnd_cb.c:2891:kiblnd_check_conns()) Timed out RDMA with

> 192.255.255.252 at o2ib

> Mar 25 18:36:01 tiger-node-0-6 kernel: LustreError:

> 4482:0:(events.c:66:request_out_callback()) @@@ type 4, status -103

> req at ffff81006ea43a00 x2646/t0

> o400->lustre-OST000f_UUID at 192.255.255.252@o2ib:28/4 lens 128/256 e 0 to

> 100 dl 1238020602 ref 2 fl Rpc:N/0/0 rc 0/0

> Mar 25 18:36:01 tiger-node-0-6 kernel: Lustre: Request x2644 sent from

> lustre-OST000d-osc-ffff81007f555000 to NID 192.255.255.252 at o2ib 59s ago

> has timed out (limit 100s).

> Mar 25 18:36:01 tiger-node-0-6 kernel: Lustre:

> lustre-OST000d-osc-ffff81007f555000: Connection to service

> lustre-OST000d via nid 192.255.255.252 at o2ib was lost; in progress

> operations using this service will wait for recovery to complete.

> Mar 25 18:36:01 tiger-node-0-6 kernel: Lustre:

> lustre-OST000e-osc-ffff81007f555000: Connection restored to service

> lustre-OST000e using nid 192.255.255.252 at o2ib.

> Mar 25 18:36:01 tiger-node-0-6 kernel: Lustre:

> lustre-OST000d-osc-ffff81007f555000: Connection restored to service

> lustre-OST000d using nid 192.255.255.252 at o2ib.

> Mar 26 04:19:35 tiger-node-0-6 kernel: LustreError:

> 4482:0:(o2iblnd_cb.c:2891:kiblnd_check_conns()) Timed out RDMA with

> 192.255.255.252 at o2ib

> Mar 26 04:20:20 tiger-node-0-6 kernel: Lustre: Request x50261 sent from

> lustre-OST000d-osc-ffff81007f555000 to NID 192.255.255.252 at o2ib 101s ago

> has timed out (limit 100s).

> Mar 26 04:20:20 tiger-node-0-6 kernel: Lustre:

> lustre-OST000d-osc-ffff81007f555000: Connection to service

> lustre-OST000d via nid 192.255.255.252 at o2ib was lost; in progress

> operations using this service will wait for recovery to complete.

> Mar 26 04:20:44 tiger-node-0-6 kernel: Lustre: Request x50290 sent from

> lustre-OST0008-osc-ffff81007f555000 to NID 192.255.255.252 at o2ib 100s ago

> has timed out (limit 100s).

> Mar 26 04:20:44 tiger-node-0-6 kernel: Lustre:

> lustre-OST0008-osc-ffff81007f555000: Connection to service

> lustre-OST0008 via nid 192.255.255.252 at o2ib was lost; in progress

> operations using this service will wait for recovery to complete.

> Mar 26 04:21:10 tiger-node-0-6 kernel: Lustre: Request x50324 sent from

> lustre-OST0008-osc-ffff81007f555000 to NID 192.255.255.252 at o2ib 101s ago

> has timed out (limit 100s).

> Mar 26 04:21:35 tiger-node-0-6 kernel: Lustre: Request x50358 sent from

> lustre-OST0008-osc-ffff81007f555000 to NID 192.255.255.252 at o2ib 101s ago

> has timed out (limit 100s).

> Mar 26 04:22:00 tiger-node-0-6 kernel: Lustre: Request x50416 sent from

> lustre-OST000d-osc-ffff81007f555000 to NID 192.255.255.252 at o2ib 100s ago

> has timed out (limit 100s).

> Mar 26 04:22:24 tiger-node-0-6 kernel: Lustre: Request x50429 sent from

> lustre-OST0008-osc-ffff81007f555000 to NID 192.255.255.252 at o2ib 100s ago

> has timed out (limit 100s).

> Mar 26 04:24:04 tiger-node-0-6 kernel: Lustre: Request x50538 sent from

> lustre-OST000d-osc-ffff81007f555000 to NID 192.255.255.252 at o2ib 100s ago

> has timed out (limit 100s).

> Mar 26 04:24:30 tiger-node-0-6 kernel: Lustre: Request x50567 sent from

> lustre-OST0008-osc-ffff81007f555000 to NID 192.255.255.252 at o2ib 101s ago

> has timed out (limit 100s).

> Mar 26 04:26:09 tiger-node-0-6 kernel: Lustre: Request x50676 sent from

> lustre-OST000d-osc-ffff81007f555000 to NID 192.255.255.252 at o2ib 100s ago

> has timed out (limit 100s).

> Mar 26 04:28:14 tiger-node-0-6 kernel: Lustre: Request x50814 sent from

> lustre-OST000d-osc-ffff81007f555000 to NID 192.255.255.252 at o2ib 100s ago

> has timed out (limit 100s).

> Mar 26 04:30:44 tiger-node-0-6 kernel: Lustre: Request x50981 sent from

> lustre-OST0008-osc-ffff81007f555000 to NID 192.255.255.252 at o2ib 100s ago

> has timed out (limit 100s).

> Mar 26 04:36:33 tiger-node-0-6 kernel: Lustre: Request x51366 sent from

> lustre-OST000d-osc-ffff81007f555000 to NID 192.255.255.252 at o2ib 100s ago

> has timed out (limit 100s).

> Mar 26 04:45:18 tiger-node-0-6 kernel: Lustre: Request x51947 sent from

> lustre-OST0008-osc-ffff81007f555000 to NID 192.255.255.252 at o2ib 100s ago

> has timed out (limit 100s).

> Mar 26 04:55:43 tiger-node-0-6 kernel: Lustre: Request x52637 sent from

> lustre-OST0008-osc-ffff81007f555000 to NID 192.255.255.252 at o2ib 100s ago

> has timed out (limit 100s).

> Mar 26 05:06:08 tiger-node-0-6 kernel: Lustre: Request x53327 sent from

> lustre-OST0008-osc-ffff81007f555000 to NID 192.255.255.252 at o2ib 100s ago

> has timed out (limit 100s).

> Mar 26 05:16:33 tiger-node-0-6 kernel: Lustre: Request x54017 sent from

> lustre-OST0008-osc-ffff81007f555000 to NID 192.255.255.252 at o2ib 100s ago

> has timed out (limit 100s).

> Mar 26 05:26:58 tiger-node-0-6 kernel: Lustre: Request x54707 sent from

> lustre-OST0008-osc-ffff81007f555000 to NID 192.255.255.252 at o2ib 100s ago

> has timed out (limit 100s).

> Mar 26 05:37:23 tiger-node-0-6 kernel: Lustre: Request x55397 sent from

> lustre-OST0008-osc-ffff81007f555000 to NID 192.255.255.252 at o2ib 100s ago

> has timed out (limit 100s).

> Mar 26 05:47:47 tiger-node-0-6 kernel: Lustre: Request x56087 sent from

> lustre-OST0008-osc-ffff81007f555000 to NID 192.255.255.252 at o2ib 100s ago

> has timed out (limit 100s).

> Mar 26 05:58:12 tiger-node-0-6 kernel: Lustre: Request x56777 sent from

> lustre-OST0008-osc-ffff81007f555000 to NID 192.255.255.252 at o2ib 100s ago

> has timed out (limit 100s).

> Mar 26 06:08:37 tiger-node-0-6 kernel: Lustre: Request x57467 sent from

> lustre-OST0008-osc-ffff81007f555000 to NID 192.255.255.252 at o2ib 100s ago

> has timed out (limit 100s).

> Mar 26 06:19:02 tiger-node-0-6 kernel: Lustre: Request x58157 sent from

> lustre-OST0008-osc-ffff81007f555000 to NID 192.255.255.252 at o2ib 100s ago

> has timed out (limit 100s).

> Mar 26 06:29:27 tiger-node-0-6 kernel: Lustre: Request x58847 sent from

> lustre-OST0008-osc-ffff81007f555000 to NID 192.255.255.252 at o2ib 100s ago

> has timed out (limit 100s).

> Mar 26 06:39:52 tiger-node-0-6 kernel: Lustre: Request x59537 sent from

> lustre-OST0008-osc-ffff81007f555000 to NID 192.255.255.252 at o2ib 100s ago

> has timed out (limit 100s).

> Mar 26 06:50:17 tiger-node-0-6 kernel: Lustre: Request x60227 sent from

> lustre-OST0008-osc-ffff81007f555000 to NID 192.255.255.252 at o2ib 100s ago

> has timed out (limit 100s).

> Mar 26 07:00:41 tiger-node-0-6 kernel: Lustre: Request x60917 sent from

> lustre-OST0008-osc-ffff81007f555000 to NID 192.255.255.252 at o2ib 100s ago

> has timed out (limit 100s).

> Mar 26 07:11:06 tiger-node-0-6 kernel: Lustre: Request x61607 sent from

> lustre-OST0008-osc-ffff81007f555000 to NID 192.255.255.252 at o2ib 100s ago

> has timed out (limit 100s).

> Mar 26 07:21:31 tiger-node-0-6 kernel: Lustre: Request x62271 sent from

> lustre-OST0008-osc-ffff81007f555000 to NID 192.255.255.252 at o2ib 100s ago

> has timed out (limit 100s).

> Mar 26 07:31:56 tiger-node-0-6 kernel: Lustre: Request x62961 sent from

> lustre-OST0008-osc-ffff81007f555000 to NID 192.255.255.252 at o2ib 100s ago

> has timed out (limit 100s).

> Mar 26 07:42:21 tiger-node-0-6 kernel: Lustre: Request x63651 sent from

> lustre-OST0008-osc-ffff81007f555000 to NID 192.255.255.252 at o2ib 100s ago

> has timed out (limit 100s).

> Mar 26 07:52:46 tiger-node-0-6 kernel: Lustre: Request x64341 sent from

> lustre-OST0008-osc-ffff81007f555000 to NID 192.255.255.252 at o2ib 100s ago

> has timed out (limit 100s).

> Mar 26 08:03:11 tiger-node-0-6 kernel: Lustre: Request x65031 sent from

> lustre-OST0008-osc-ffff81007f555000 to NID 192.255.255.252 at o2ib 100s ago

> has timed out (limit 100s).

> Mar 26 08:13:36 tiger-node-0-6 kernel: Lustre: Request x65721 sent from

> lustre-OST0008-osc-ffff81007f555000 to NID 192.255.255.252 at o2ib 100s ago

> has timed out (limit 100s).

> Mar 26 08:24:00 tiger-node-0-6 kernel: Lustre: Request x66411 sent from

> lustre-OST0008-osc-ffff81007f555000 to NID 192.255.255.252 at o2ib 100s ago

> has timed out (limit 100s).

> Mar 26 08:34:25 tiger-node-0-6 kernel: Lustre: Request x67101 sent from

> lustre-OST0008-osc-ffff81007f555000 to NID 192.255.255.252 at o2ib 100s ago

> has timed out (limit 100s).

> Mar 26 08:44:50 tiger-node-0-6 kernel: Lustre: Request x67791 sent from

> lustre-OST0008-osc-ffff81007f555000 to NID 192.255.255.252 at o2ib 100s ago

> has timed out (limit 100s).

> Mar 26 08:55:15 tiger-node-0-6 kernel: Lustre: Request x68481 sent from

> lustre-OST0008-osc-ffff81007f555000 to NID 192.255.255.252 at o2ib 100s ago

> has timed out (limit 100s).

> Mar 26 09:05:40 tiger-node-0-6 kernel: Lustre: Request x69171 sent from

> lustre-OST0008-osc-ffff81007f555000 to NID 192.255.255.252 at o2ib 100s ago

> has timed out (limit 100s).

> Mar 26 09:16:05 tiger-node-0-6 kernel: Lustre: Request x69861 sent from

> lustre-OST0008-osc-ffff81007f555000 to NID 192.255.255.252 at o2ib 100s ago

> has timed out (limit 100s).

> Mar 26 09:26:30 tiger-node-0-6 kernel: Lustre: Request x70551 sent from

> lustre-OST0008-osc-ffff81007f555000 to NID 192.255.255.252 at o2ib 100s ago

> has timed out (limit 100s).

> Mar 26 09:36:54 tiger-node-0-6 kernel: Lustre: Request x71249 sent from

> lustre-OST0008-osc-ffff81007f555000 to NID 192.255.255.252 at o2ib 100s ago

> has timed out (limit 100s).

> Mar 26 09:38:43 tiger-node-0-6 kernel: Lustre:

> 8821:0:(api-ni.c:1685:lnet_ping()) ping 12345-192.255.255.252 at o2ib: late

> network completion

> Mar 26 09:39:43 tiger-node-0-6 kernel: Lustre:

> 8821:0:(api-ni.c:1685:lnet_ping()) ping 12345-192.255.255.252 at o2ib: late

> network completion

> Mar 26 09:40:43 tiger-node-0-6 kernel: Lustre:

> 8821:0:(api-ni.c:1685:lnet_ping()) ping 12345-192.255.255.252 at o2ib: late

> network completion

> Mar 26 09:41:43 tiger-node-0-6 kernel: Lustre:

> 8821:0:(api-ni.c:1685:lnet_ping()) ping 12345-192.255.255.252 at o2ib: late

> network completion

> Mar 26 09:42:43 tiger-node-0-6 kernel: Lustre:

> 8821:0:(api-ni.c:1685:lnet_ping()) ping 12345-192.255.255.252 at o2ib: late

> network completion

> Mar 26 09:47:19 tiger-node-0-6 kernel: Lustre: Request x71939 sent from

> lustre-OST0008-osc-ffff81007f555000 to NID 192.255.255.252 at o2ib 100s ago

> has timed out (limit 100s).

> Mar 26 09:48:43 tiger-node-0-6 kernel: Lustre:

> 8821:0:(api-ni.c:1685:lnet_ping()) ping 12345-192.255.255.252 at o2ib: late

> network completion



More information about the lustre-discuss mailing list