[Lustre-discuss] Rx failures

Ulrich Sibiller U.Sibiller at science-computing.de
Wed Feb 10 04:52:23 PST 2010


Ulrich Sibiller schrieb:
> Hi,
> 
> we are experiencing some weird behaviour on one of our Lustre clients.
> 
> First some information about our environment:
> - Lustre 1.8.1.1 CentOS52 (Kernel 2.6.18-92.1.13.el5)
> - self-compiled patchless clients with quotas enabled (quotas not used at the moment)
> - Infiniband interconnect, OFED 1.3.3
> - OSS: 2x Sun Fire X4540 with 48TB, with official sun kernel 2.6.18-128.7.1.el5_lustre.1.8.1.1, 4
> OSTs on
> each of them, OFED 1.3.3
> - MDS: 2x Sun Fire X4100 M2 + 1x StorEdge 3320 with heartbeat failover, official sun kernel
> 2.6.18-128.7.1.el5_lustre.1.8.1.1, OFED 1.4.2
> - Lustre mounted on /hpcscr
> - OSS1 is 192.168.60.238 at o2ib, hostname is hpc9oss1
> - OSS2 is 192.168.60.237 at o2ib, hostname is hpc9oss2
> - MDS1 is 192.168.60.240 at o2ib, hostname is hpc9mds1 (active)
> - MDS2 is 192.168.60.239 at o2ib, hostname is hpc9mds2 (standby, was active for a short time while
> mds1 was lifted from 1.6.7.1 to 1.8.1.1)
> - problematic client is 192.168.60.226 at o2ib, hostname hpc9master02
> - no problems on the Infiniband
> 
> Problem:
> Users report a slow Lustre filesystem on this particular machine (hpc9master02). Running "find
> /hpcscr -ls" gets stuck after some time and most of the time it continues after some seconds, but
> sometimes it takes several minutes and sometimes I get errors (one "I/O error", then several "Cannot
> send after transport endpoint shutdown") and the find terminates. The IB error counters do not change
> during this test.

I discovered that this problem only arises when this client uses the infiniband connection (o2ib). 
IB port counters do not increase anywhere. Running over ethernet works perfectly.

I am now running Lustre 1.8.2 on the client and on all Lustre servers, I exchanged the client IB 
cable, use a different switch port and the client's other HCA port but it still does not work. With 
1.8.2 I constantly see error -113 on the client and all Lustre servers:

Feb 10 13:30:38 hpc9master02 kernel: Lustre: 4477:0:(import.c:524:import_select_connection()) 
cfd1-OST0000-osc-ffff812025f37c00: tried all connections, increasing latency to 3s
Feb 10 13:30:38 hpc9master02 kernel: Lustre: 4477:0:(import.c:524:import_select_connection()) 
Skipped 2 previous similar messages
Feb 10 13:30:45 hpc9master02 kernel: Lustre: 4477:0:(import.c:524:import_select_connection()) 
cfd1-OST0002-osc-ffff812025f37c00: tried all connections, increasing latency to 3s
Feb 10 13:30:45 hpc9master02 kernel: Lustre: 4477:0:(import.c:524:import_select_connection()) 
Skipped 1 previous similar message
Feb 10 13:31:49 hpc9master02 kernel: Lustre: 4609:0:(client.c:1434:ptlrpc_expire_one_request()) @@@ 
Request x1327292548671897 sent from cfd1-OST0007-osc-ffff812025f37c00 to NID 192.168.60.237 at o2ib 44s 
ago has timed out (44s prior to deadline).
Feb 10 13:31:49 hpc9master02 kernel:   req at ffff811fbb893000 x1327292548671897/t0 
o101->cfd1-OST0007_UUID at 192.168.60.237@o2ib:28/4 lens 296/544 e 0 to 1 dl 1265805109 ref 1 fl 
Rpc:/0/0 rc 0/0
Feb 10 13:31:49 hpc9master02 kernel: Lustre: 4609:0:(client.c:1434:ptlrpc_expire_one_request()) 
Skipped 12 previous similar messages
Feb 10 13:31:49 hpc9master02 kernel: Lustre: cfd1-OST0007-osc-ffff812025f37c00: Connection to 
service cfd1-OST0007 via nid 192.168.60.237 at o2ib was lost; in progress operations using this service 
will wait for recovery to complete.
Feb 10 13:31:49 hpc9master02 kernel: Lustre: Skipped 4 previous similar messages
Feb 10 13:31:49 hpc9master02 kernel: Lustre: cfd1-OST0007-osc-ffff812025f37c00: Connection restored 
to service cfd1-OST0007 using nid 192.168.60.237 at o2ib.
Feb 10 13:31:49 hpc9master02 kernel: Lustre: Skipped 4 previous similar messages
Feb 10 13:32:27 hpc9master02 kernel: Lustre: cfd1-MDT0000-mdc-ffff812025f37c00: Connection to 
service cfd1-MDT0000 via nid 192.168.60.239 at o2ib was lost; in progress operations using this service 
will wait for recovery to complete.
Feb 10 13:32:27 hpc9master02 kernel: Lustre: Skipped 1 previous similar message
Feb 10 13:32:27 hpc9master02 kernel: Lustre: cfd1-MDT0000-mdc-ffff812025f37c00: Connection restored 
to service cfd1-MDT0000 using nid 192.168.60.239 at o2ib.
Feb 10 13:32:27 hpc9master02 kernel: Lustre: Skipped 1 previous similar message
Feb 10 13:33:24 hpc9master02 kernel: LustreError: 4475:0:(lib-move.c:2436:LNetPut()) Error sending 
PUT to 12345-192.168.60.239 at o2ib: -113
Feb 10 13:33:24 hpc9master02 kernel: LustreError: 4475:0:(lib-move.c:2436:LNetPut()) Skipped 1 
previous similar message
Feb 10 13:33:24 hpc9master02 kernel: LustreError: 4475:0:(events.c:66:request_out_callback()) @@@ 
type 4, status -113  req at ffff811ff264c000 x1327292548700295/t0 o400->MGS at 192.168.60.239@o2ib:26/25 
lens 192/384 e 0 to 1 dl 1265805221 ref 2 fl Rpc:N/0/0 rc 0/0
Feb 10 13:33:24 hpc9master02 kernel: LustreError: 4475:0:(events.c:66:request_out_callback()) 
Skipped 1 previous similar message
Feb 10 13:33:24 hpc9master02 kernel: LustreError: 166-1: MGC192.168.60.240 at o2ib: Connection to 
service MGS via nid 192.168.60.239 at o2ib was lost; in progress operations using this service will fail.
Feb 10 13:33:31 hpc9master02 kernel: Lustre: cfd1-OST0002-osc-ffff812025f37c00: Connection restored 
to service cfd1-OST0002 using nid 192.168.60.238 at o2ib.
Feb 10 13:33:32 hpc9master02 kernel: Lustre: 4477:0:(import.c:524:import_select_connection()) 
cfd1-MDT0000-mdc-ffff812025f37c00: tried all connections, increasing latency to 2s
Feb 10 13:33:32 hpc9master02 kernel: LustreError: 4476:0:(lib-move.c:2436:LNetPut()) Error sending 
PUT to 12345-192.168.60.239 at o2ib: -113
Feb 10 13:33:32 hpc9master02 kernel: LustreError: 4476:0:(lib-move.c:2436:LNetPut()) Skipped 3 
previous similar messages
Feb 10 13:33:32 hpc9master02 kernel: LustreError: 4476:0:(events.c:66:request_out_callback()) @@@ 
type 4, status -113  req at ffff811b7da91800 x1327292548700311/t0 o250->MGS at 192.168.60.239@o2ib:26/25 
lens 368/584 e 0 to 1 dl 1265805218 ref 2 fl Rpc:N/0/0 rc 0/0
Feb 10 13:33:32 hpc9master02 kernel: LustreError: 4476:0:(events.c:66:request_out_callback()) 
Skipped 3 previous similar messages
Feb 10 13:33:32 hpc9master02 kernel: Lustre: 4477:0:(import.c:524:import_select_connection()) 
Skipped 1 previous similar message
Feb 10 13:33:39 hpc9master02 kernel: Lustre: MGC192.168.60.240 at o2ib: Reactivating import
Feb 10 13:33:40 hpc9master02 kernel: Lustre: 4477:0:(import.c:524:import_select_connection()) 
cfd1-MDT0000-mdc-ffff812025f37c00: tried all connections, increasing latency to 3s
Feb 10 13:34:40 hpc9master02 kernel: Lustre: 4609:0:(client.c:1434:ptlrpc_expire_one_request()) @@@ 
Request x1327292548717600 sent from cfd1-OST0002-osc-ffff812025f37c00 to NID 192.168.60.238 at o2ib 46s 
ago has timed out (46s prior to deadline).
Feb 10 13:34:40 hpc9master02 kernel:   req at ffff811b7d81a000 x1327292548717600/t0 
o101->cfd1-OST0002_UUID at 192.168.60.238@o2ib:28/4 lens 296/544 e 0 to 1 dl 1265805280 ref 1 fl 
Rpc:/0/0 rc 0/0
Feb 10 13:34:40 hpc9master02 kernel: Lustre: 4609:0:(client.c:1434:ptlrpc_expire_one_request()) 
Skipped 13 previous similar messages
Feb 10 13:34:40 hpc9master02 kernel: Lustre: cfd1-OST0002-osc-ffff812025f37c00: Connection to 
service cfd1-OST0002 via nid 192.168.60.238 at o2ib was lost; in progress operations using this service 
will wait for recovery to complete.
Feb 10 13:34:40 hpc9master02 kernel: Lustre: Skipped 2 previous similar messages

According to

root at hpc9master02 network-scripts # find /usr/include -name "errno*" | xargs grep -E "\<113\>"
/usr/include/asm-generic/errno.h:#define        EHOSTUNREACH    113     /* No route to host */

this error means "no route to host". How can this happen?

Uli



-- 
__________________________________creating IT solutions
Dipl.-Inf. Ulrich Sibiller     science + computing ag
System Administration          Hagellocher Weg 73
fax      +49 7071 9457 411     72070 Tuebingen, Germany
teamline +49 7071 9457 674     www.science-computing.de
-- 
Vorstand/Board of Management:
Dr. Bernd Finkbeiner, Dr. Roland Niemeier, 
Dr. Arno Steitz, Dr. Ingrid Zech
Vorsitzender des Aufsichtsrats/
Chairman of the Supervisory Board:
Michel Lepert
Sitz/Registered Office: Tuebingen
Registergericht/Registration Court: Stuttgart
Registernummer/Commercial Register No.: HRB 382196 





More information about the lustre-discuss mailing list