[Lustre-discuss] Lustre client problems
Lawrence Sorrillo
sorrillo at jlab.org
Wed Apr 7 07:23:57 PDT 2010
Has anyone seen this before?
I have a lustre client that will work well soon after reboot (giving
300MB/sec writes over SDR infiniband to a lustre mount ) but then after
a couple of hours the
the mount will stop working-I get hangs on files coming from particular
OSTs. Simultaneously, other clients, built a bit differently, do not
hang on the same OST.
All clients with this particular build share this same malady.
This is RHEL5u3/4 with OFED 1.5 and Lustre 1.8.2.
(uname -a)
Linux host0 2.6.18-164.6.1.0.1.el5 #10 SMP Fri Mar 12 17:45:10 EST 2010
x86_64 x86_64 x86_64 GNU/Linux
Here is what it displays (/var/log/messages ) soon after reboot and for
initial read/writes to the lustre mount areas.
Apr 6 13:37:04 host0 kernel: Lustre: OBD class driver,
http://www.lustre.org/
Apr 6 13:37:04 host0 kernel: Lustre: Lustre Version: 1.8.2
Apr 6 13:37:04 host0 kernel: Lustre: Build Version:
1.8.2-20100122203014-PRISTINE-2.6.18-164.6.1.0.1.el5
Apr 6 13:37:05 host0 kernel: Lustre: Listener bound to
ib0:172.17.3.61:987:mthca0
Apr 6 13:37:05 host0 kernel: Lustre: Register global MR array, MR size:
0xffffffffffffffff, array size: 1
Apr 6 13:37:05 host0 kernel: Lustre: Added LNI 172.17.3.61 at o2ib
[8/64/0/180]
Apr 6 13:37:05 host0 kernel: Lustre: Added LNI X.X.X.X at tcp [8/256/0/180]
Apr 6 13:37:05 host0 kernel: Lustre: Accept secure, port 988
Apr 6 13:37:06 host0 kernel: Lustre: Lustre Client File System;
http://www.lustre.org/
Apr 6 13:37:06 host0 kernel: Lustre: MGC172.17.1.83 at o2ib: Reactivating
import
Apr 6 13:37:06 host0 kernel: Lustre: Client lustre-client has started
....
....
. Everthings is fine here....just OS messages that do not pertain to lustre
....
....
Apr 6 23:45:55 host0 dhclient: DHCPACK from X.X.X.X
Apr 6 23:45:55 host0 dhclient: bound to 129.57.16.37 -- renewal in
36986 seconds.
Apr 7 08:38:36 host0 : error getting update info: (104, 'Connection
reset by peer')
Apr 7 09:09:30 host0 kernel: LustreError:
5270:0:(o2iblnd_cb.c:2883:kiblnd_check_txs()) Timed out tx: active_txs,
9 seconds
Apr 7 09:09:30 host0 kernel: LustreError:
5270:0:(o2iblnd_cb.c:2945:kiblnd_check_conns()) Timed out RDMA with
172.17.1.108 at o2ib (84)
Apr 7 09:09:45 host0 kernel: LustreError:
5312:0:(lib-move.c:2436:LNetPut()) Error sending PUT to
12345-172.17.1.108 at o2ib: -113
Apr 7 09:09:45 host0 kernel: LustreError:
5312:0:(events.c:66:request_out_callback()) @@@ type 4, status -113
req at ffff810509419000 x1332294902650884/t0
o400->lustre-OST0018_UUID at 172.17.1.108@o2ib:28/4 lens 192/384 e 0 to 1
dl 1270645802 ref 2 fl Rpc:N/0/0 rc 0/0
Apr 7 09:09:45 host0 kernel: Lustre:
5312:0:(client.c:1434:ptlrpc_expire_one_request()) @@@ Request
x1332294902650884 sent from lustre-OST0018-osc-ffff810335e15c00 to NID
172.17.1.108 at o2ib 0s ago has failed due to network error (17s prior to
deadline).
Apr 7 09:09:45 host0 kernel: req at ffff810509419000
x1332294902650884/t0 o400->lustre-OST0018_UUID at 172.17.1.108@o2ib:28/4
lens 192/384 e 0 to 1 dl 1270645802 ref 1 fl Rpc:N/0/0 rc 0/0
Apr 7 09:09:45 host0 kernel: Lustre:
lustre-OST0018-osc-ffff810335e15c00: Connection to service
lustre-OST0018 via nid 172.17.1.108 at o2ib was lost; in progress
operations using this service will wait for recovery to complete.
Apr 7 09:09:45 host0 kernel: LustreError:
5312:0:(lib-move.c:2436:LNetPut()) Error sending PUT to
12345-172.17.1.108 at o2ib: -113
Apr 7 09:09:45 host0 kernel: LustreError:
5313:0:(events.c:66:request_out_callback()) @@@ type 4, status -113
req at ffff8104345b2c00 x1332294902650898/t0
o8->lustre-OST0018_UUID at 172.17.1.108@o2ib:28/4 lens 368/584 e 0 to 1 dl
1270645791 ref 2 fl Rpc:N/0/0 rc 0/0
Apr 7 09:09:45 host0 kernel: Lustre:
5313:0:(client.c:1434:ptlrpc_expire_one_request()) @@@ Request
x1332294902650898 sent from lustre-OST0018-osc-ffff810335e15c00 to NID
172.17.1.108 at o2ib 0s ago has failed due to network error (6s prior to
deadline).
Apr 7 09:09:45 host0 kernel: req at ffff8104345b2c00
x1332294902650898/t0 o8->lustre-OST0018_UUID at 172.17.1.108@o2ib:28/4 lens
368/584 e 0 to 1 dl 1270645791 ref 1 fl Rpc:N/0/0 rc 0/0
Apr 7 09:09:45 host0 kernel: LustreError:
5312:0:(lib-move.c:2436:LNetPut()) Skipped 1 previous similar message
Apr 7 09:09:45 host0 kernel: Lustre:
lustre-OST0019-osc-ffff810335e15c00: Connection to service
lustre-OST0019 via nid 172.17.1.108 at o2ib was lost; in progress
operations using this service will wait for recovery to complete.
Apr 7 09:09:52 host0 kernel: Lustre:
5314:0:(import.c:524:import_select_connection())
lustre-OST0018-osc-ffff810335e15c00: tried all connections, increasing
latency to 2s
Apr 7 09:09:59 host0 kernel: Lustre:
5313:0:(client.c:1434:ptlrpc_expire_one_request()) @@@ Request
x1332294902654188 sent from lustre-OST0018-osc-ffff810335e15c00 to NID
172.17.1.108 at o2ib 7s ago has timed out (7s prior to deadline).
Apr 7 09:09:59 host0 kernel: req at ffff8104ff9c6c00
x1332294902654188/t0 o8->lustre-OST0018_UUID at 172.17.1.108@o2ib:28/4 lens
368/584 e 0 to 1 dl 1270645799 ref 2 fl Rpc:N/0/0 rc 0/0
Apr 7 09:09:59 host0 kernel: Lustre:
5313:0:(client.c:1434:ptlrpc_expire_one_request()) Skipped 4 previous
similar messages
Apr 7 09:10:00 host0 kernel: Lustre:
5314:0:(import.c:524:import_select_connection())
lustre-OST0018-osc-ffff810335e15c00: tried all connections, increasing
latency to 3s
Apr 7 09:10:00 host0 kernel: Lustre:
5314:0:(import.c:524:import_select_connection()) Skipped 2 previous
similar messages
Apr 7 09:10:08 host0 kernel: Lustre:
5313:0:(client.c:1434:ptlrpc_expire_one_request()) @@@ Request
x1332294902658081 sent from lustre-OST0018-osc-ffff810335e15c00 to NID
172.17.1.108 at o2ib 8s ago has timed out (8s prior to deadline).
Apr 7 09:10:08 host0 kernel: req at ffff810378e91400
x1332294902658081/t0 o8->lustre-OST0018_UUID at 172.17.1.108@o2ib:28/4 lens
368/584 e 0 to 1 dl 1270645808 ref 2 fl Rpc:N/0/0 rc 0/0
Apr 7 09:10:08 host0 kernel: Lustre:
5313:0:(client.c:1434:ptlrpc_expire_one_request()) Skipped 2 previous
similar messages
~Lawrence
~
More information about the lustre-discuss
mailing list