[Lustre-discuss] Lustre client problems

Lawrence Sorrillo sorrillo at jlab.org
Wed Apr 7 07:23:57 PDT 2010


Has anyone seen this before?


I have a lustre client that will work well soon after reboot (giving 
300MB/sec writes over SDR infiniband to a lustre mount ) but then after 
a couple of hours the
the mount will stop working-I get hangs on files coming from particular 
OSTs. Simultaneously, other clients, built a bit differently, do not 
hang on the same OST. 

All clients with this particular build share this same malady.

This is RHEL5u3/4 with OFED 1.5 and Lustre 1.8.2.

(uname -a)
Linux host0 2.6.18-164.6.1.0.1.el5 #10 SMP Fri Mar 12 17:45:10 EST 2010 
x86_64 x86_64 x86_64 GNU/Linux


Here is what it displays (/var/log/messages ) soon after reboot and for 
initial read/writes to the lustre mount areas.

Apr  6 13:37:04 host0 kernel: Lustre: OBD class driver, 
http://www.lustre.org/
Apr  6 13:37:04 host0 kernel: Lustre:     Lustre Version: 1.8.2
Apr  6 13:37:04 host0 kernel: Lustre:     Build Version: 
1.8.2-20100122203014-PRISTINE-2.6.18-164.6.1.0.1.el5
Apr  6 13:37:05 host0 kernel: Lustre: Listener bound to 
ib0:172.17.3.61:987:mthca0
Apr  6 13:37:05 host0 kernel: Lustre: Register global MR array, MR size: 
0xffffffffffffffff, array size: 1
Apr  6 13:37:05 host0 kernel: Lustre: Added LNI 172.17.3.61 at o2ib 
[8/64/0/180]
Apr  6 13:37:05 host0 kernel: Lustre: Added LNI X.X.X.X at tcp [8/256/0/180]
Apr  6 13:37:05 host0 kernel: Lustre: Accept secure, port 988
Apr  6 13:37:06 host0 kernel: Lustre: Lustre Client File System; 
http://www.lustre.org/
Apr  6 13:37:06 host0 kernel: Lustre: MGC172.17.1.83 at o2ib: Reactivating 
import
Apr  6 13:37:06 host0 kernel: Lustre: Client lustre-client has started


....
....
. Everthings is fine here....just OS messages that do not pertain to lustre
....
....
Apr  6 23:45:55 host0 dhclient: DHCPACK from X.X.X.X
Apr  6 23:45:55 host0 dhclient: bound to 129.57.16.37 -- renewal in 
36986 seconds.
Apr  7 08:38:36 host0 : error getting update info: (104, 'Connection 
reset by peer')
Apr  7 09:09:30 host0 kernel: LustreError: 
5270:0:(o2iblnd_cb.c:2883:kiblnd_check_txs()) Timed out tx: active_txs, 
9 seconds
Apr  7 09:09:30 host0 kernel: LustreError: 
5270:0:(o2iblnd_cb.c:2945:kiblnd_check_conns()) Timed out RDMA with 
172.17.1.108 at o2ib (84)
Apr  7 09:09:45 host0 kernel: LustreError: 
5312:0:(lib-move.c:2436:LNetPut()) Error sending PUT to 
12345-172.17.1.108 at o2ib: -113
Apr  7 09:09:45 host0 kernel: LustreError: 
5312:0:(events.c:66:request_out_callback()) @@@ type 4, status -113  
req at ffff810509419000 x1332294902650884/t0 
o400->lustre-OST0018_UUID at 172.17.1.108@o2ib:28/4 lens 192/384 e 0 to 1 
dl 1270645802 ref 2 fl Rpc:N/0/0 rc 0/0
Apr  7 09:09:45 host0 kernel: Lustre: 
5312:0:(client.c:1434:ptlrpc_expire_one_request()) @@@ Request 
x1332294902650884 sent from lustre-OST0018-osc-ffff810335e15c00 to NID 
172.17.1.108 at o2ib 0s ago has failed due to network error (17s prior to 
deadline).
Apr  7 09:09:45 host0 kernel:   req at ffff810509419000 
x1332294902650884/t0 o400->lustre-OST0018_UUID at 172.17.1.108@o2ib:28/4 
lens 192/384 e 0 to 1 dl 1270645802 ref 1 fl Rpc:N/0/0 rc 0/0
Apr  7 09:09:45 host0 kernel: Lustre: 
lustre-OST0018-osc-ffff810335e15c00: Connection to service 
lustre-OST0018 via nid 172.17.1.108 at o2ib was lost; in progress 
operations using this service will wait for recovery to complete.
Apr  7 09:09:45 host0 kernel: LustreError: 
5312:0:(lib-move.c:2436:LNetPut()) Error sending PUT to 
12345-172.17.1.108 at o2ib: -113
Apr  7 09:09:45 host0 kernel: LustreError: 
5313:0:(events.c:66:request_out_callback()) @@@ type 4, status -113  
req at ffff8104345b2c00 x1332294902650898/t0 
o8->lustre-OST0018_UUID at 172.17.1.108@o2ib:28/4 lens 368/584 e 0 to 1 dl 
1270645791 ref 2 fl Rpc:N/0/0 rc 0/0
Apr  7 09:09:45 host0 kernel: Lustre: 
5313:0:(client.c:1434:ptlrpc_expire_one_request()) @@@ Request 
x1332294902650898 sent from lustre-OST0018-osc-ffff810335e15c00 to NID 
172.17.1.108 at o2ib 0s ago has failed due to network error (6s prior to 
deadline).
Apr  7 09:09:45 host0 kernel:   req at ffff8104345b2c00 
x1332294902650898/t0 o8->lustre-OST0018_UUID at 172.17.1.108@o2ib:28/4 lens 
368/584 e 0 to 1 dl 1270645791 ref 1 fl Rpc:N/0/0 rc 0/0
Apr  7 09:09:45 host0 kernel: LustreError: 
5312:0:(lib-move.c:2436:LNetPut()) Skipped 1 previous similar message
Apr  7 09:09:45 host0 kernel: Lustre: 
lustre-OST0019-osc-ffff810335e15c00: Connection to service 
lustre-OST0019 via nid 172.17.1.108 at o2ib was lost; in progress 
operations using this service will wait for recovery to complete.
Apr  7 09:09:52 host0 kernel: Lustre: 
5314:0:(import.c:524:import_select_connection()) 
lustre-OST0018-osc-ffff810335e15c00: tried all connections, increasing 
latency to 2s
Apr  7 09:09:59 host0 kernel: Lustre: 
5313:0:(client.c:1434:ptlrpc_expire_one_request()) @@@ Request 
x1332294902654188 sent from lustre-OST0018-osc-ffff810335e15c00 to NID 
172.17.1.108 at o2ib 7s ago has timed out (7s prior to deadline).
Apr  7 09:09:59 host0 kernel:   req at ffff8104ff9c6c00 
x1332294902654188/t0 o8->lustre-OST0018_UUID at 172.17.1.108@o2ib:28/4 lens 
368/584 e 0 to 1 dl 1270645799 ref 2 fl Rpc:N/0/0 rc 0/0
Apr  7 09:09:59 host0 kernel: Lustre: 
5313:0:(client.c:1434:ptlrpc_expire_one_request()) Skipped 4 previous 
similar messages
Apr  7 09:10:00 host0 kernel: Lustre: 
5314:0:(import.c:524:import_select_connection()) 
lustre-OST0018-osc-ffff810335e15c00: tried all connections, increasing 
latency to 3s
Apr  7 09:10:00 host0 kernel: Lustre: 
5314:0:(import.c:524:import_select_connection()) Skipped 2 previous 
similar messages
Apr  7 09:10:08 host0 kernel: Lustre: 
5313:0:(client.c:1434:ptlrpc_expire_one_request()) @@@ Request 
x1332294902658081 sent from lustre-OST0018-osc-ffff810335e15c00 to NID 
172.17.1.108 at o2ib 8s ago has timed out (8s prior to deadline).
Apr  7 09:10:08 host0 kernel:   req at ffff810378e91400 
x1332294902658081/t0 o8->lustre-OST0018_UUID at 172.17.1.108@o2ib:28/4 lens 
368/584 e 0 to 1 dl 1270645808 ref 2 fl Rpc:N/0/0 rc 0/0
Apr  7 09:10:08 host0 kernel: Lustre: 
5313:0:(client.c:1434:ptlrpc_expire_one_request()) Skipped 2 previous 
similar messages

~Lawrence
~





More information about the lustre-discuss mailing list