[Lustre-discuss] Lustre client problems

Lawrence Sorrillo sorrillo at jlab.org
Wed Apr 7 07:59:33 PDT 2010


Also, the logs from the OST that is providing the files for which we 
have hangs are showing the following errors:

Apr  7 02:51:45 loss09 kernel: Lustre: Skipped 1 previous similar message
Apr  7 02:51:45 loss09 kernel: Lustre: lustre-OST001a: haven't heard 
from client dd7aee74-0bb9-7b4a-4c7f-d0e78fff45ef (at 172.17.0.160 at o2ib) 
in 227 seconds. I think it's dead, and I am evicting it.
Apr  7 02:51:45 loss09 kernel: Lustre: Skipped 1 previous similar message
Apr  7 02:53:18 loss09 kernel: LustreError: 
13561:0:(ldlm_lib.c:1863:target_send_reply_msg()) @@@ processing error 
(-107)  req at ffff81018021c000 x1326517357508998/t0 o400-><?>@<?>:0/0 lens 
192/0 e 0 to 0 dl 1270623204 ref 1 fl Interpret:H/0/0 rc -107/0
Apr  7 02:53:18 loss09 kernel: LustreError: 
13561:0:(ldlm_lib.c:1863:target_send_reply_msg()) Skipped 5 previous 
similar messages
Apr  7 09:12:42 loss09 kernel: Lustre: lustre-OST001a: haven't heard 
from client 6c81ad18-13bb-6455-06a2-a1f413f967e9 (at 172.17.3.61 at o2ib) 
in 227 seconds. I think it's dead, and I am evicting it.
Apr  7 09:13:07 host09 kernel: Lustre: lustre-OST0018: haven't heard 
from client 6c81ad18-13bb-6455-06a2-a1f413f967e9 (at 172.17.3.61 at o2ib) 
in 227 seconds. I think it's dead, and I am evicting it.


172.17.3.61 at o2ib is the IB interface for the client experiencing the 
hang condition.

~Lawrence

Lawrence Sorrillo wrote:
> Has anyone seen this before?
>
>
> I have a lustre client that will work well soon after reboot (giving 
> 300MB/sec writes over SDR infiniband to a lustre mount ) but then after 
> a couple of hours the
> the mount will stop working-I get hangs on files coming from particular 
> OSTs. Simultaneously, other clients, built a bit differently, do not 
> hang on the same OST. 
>
> All clients with this particular build share this same malady.
>
> This is RHEL5u3/4 with OFED 1.5 and Lustre 1.8.2.
>
> (uname -a)
> Linux host0 2.6.18-164.6.1.0.1.el5 #10 SMP Fri Mar 12 17:45:10 EST 2010 
> x86_64 x86_64 x86_64 GNU/Linux
>
>
> Here is what it displays (/var/log/messages ) soon after reboot and for 
> initial read/writes to the lustre mount areas.
>
> Apr  6 13:37:04 host0 kernel: Lustre: OBD class driver, 
> http://www.lustre.org/
> Apr  6 13:37:04 host0 kernel: Lustre:     Lustre Version: 1.8.2
> Apr  6 13:37:04 host0 kernel: Lustre:     Build Version: 
> 1.8.2-20100122203014-PRISTINE-2.6.18-164.6.1.0.1.el5
> Apr  6 13:37:05 host0 kernel: Lustre: Listener bound to 
> ib0:172.17.3.61:987:mthca0
> Apr  6 13:37:05 host0 kernel: Lustre: Register global MR array, MR size: 
> 0xffffffffffffffff, array size: 1
> Apr  6 13:37:05 host0 kernel: Lustre: Added LNI 172.17.3.61 at o2ib 
> [8/64/0/180]
> Apr  6 13:37:05 host0 kernel: Lustre: Added LNI X.X.X.X at tcp [8/256/0/180]
> Apr  6 13:37:05 host0 kernel: Lustre: Accept secure, port 988
> Apr  6 13:37:06 host0 kernel: Lustre: Lustre Client File System; 
> http://www.lustre.org/
> Apr  6 13:37:06 host0 kernel: Lustre: MGC172.17.1.83 at o2ib: Reactivating 
> import
> Apr  6 13:37:06 host0 kernel: Lustre: Client lustre-client has started
>
>
> ....
> ....
> . Everthings is fine here....just OS messages that do not pertain to lustre
> ....
> ....
> Apr  6 23:45:55 host0 dhclient: DHCPACK from X.X.X.X
> Apr  6 23:45:55 host0 dhclient: bound to 129.57.16.37 -- renewal in 
> 36986 seconds.
> Apr  7 08:38:36 host0 : error getting update info: (104, 'Connection 
> reset by peer')
> Apr  7 09:09:30 host0 kernel: LustreError: 
> 5270:0:(o2iblnd_cb.c:2883:kiblnd_check_txs()) Timed out tx: active_txs, 
> 9 seconds
> Apr  7 09:09:30 host0 kernel: LustreError: 
> 5270:0:(o2iblnd_cb.c:2945:kiblnd_check_conns()) Timed out RDMA with 
> 172.17.1.108 at o2ib (84)
> Apr  7 09:09:45 host0 kernel: LustreError: 
> 5312:0:(lib-move.c:2436:LNetPut()) Error sending PUT to 
> 12345-172.17.1.108 at o2ib: -113
> Apr  7 09:09:45 host0 kernel: LustreError: 
> 5312:0:(events.c:66:request_out_callback()) @@@ type 4, status -113  
> req at ffff810509419000 x1332294902650884/t0 
> o400->lustre-OST0018_UUID at 172.17.1.108@o2ib:28/4 lens 192/384 e 0 to 1 
> dl 1270645802 ref 2 fl Rpc:N/0/0 rc 0/0
> Apr  7 09:09:45 host0 kernel: Lustre: 
> 5312:0:(client.c:1434:ptlrpc_expire_one_request()) @@@ Request 
> x1332294902650884 sent from lustre-OST0018-osc-ffff810335e15c00 to NID 
> 172.17.1.108 at o2ib 0s ago has failed due to network error (17s prior to 
> deadline).
> Apr  7 09:09:45 host0 kernel:   req at ffff810509419000 
> x1332294902650884/t0 o400->lustre-OST0018_UUID at 172.17.1.108@o2ib:28/4 
> lens 192/384 e 0 to 1 dl 1270645802 ref 1 fl Rpc:N/0/0 rc 0/0
> Apr  7 09:09:45 host0 kernel: Lustre: 
> lustre-OST0018-osc-ffff810335e15c00: Connection to service 
> lustre-OST0018 via nid 172.17.1.108 at o2ib was lost; in progress 
> operations using this service will wait for recovery to complete.
> Apr  7 09:09:45 host0 kernel: LustreError: 
> 5312:0:(lib-move.c:2436:LNetPut()) Error sending PUT to 
> 12345-172.17.1.108 at o2ib: -113
> Apr  7 09:09:45 host0 kernel: LustreError: 
> 5313:0:(events.c:66:request_out_callback()) @@@ type 4, status -113  
> req at ffff8104345b2c00 x1332294902650898/t0 
> o8->lustre-OST0018_UUID at 172.17.1.108@o2ib:28/4 lens 368/584 e 0 to 1 dl 
> 1270645791 ref 2 fl Rpc:N/0/0 rc 0/0
> Apr  7 09:09:45 host0 kernel: Lustre: 
> 5313:0:(client.c:1434:ptlrpc_expire_one_request()) @@@ Request 
> x1332294902650898 sent from lustre-OST0018-osc-ffff810335e15c00 to NID 
> 172.17.1.108 at o2ib 0s ago has failed due to network error (6s prior to 
> deadline).
> Apr  7 09:09:45 host0 kernel:   req at ffff8104345b2c00 
> x1332294902650898/t0 o8->lustre-OST0018_UUID at 172.17.1.108@o2ib:28/4 lens 
> 368/584 e 0 to 1 dl 1270645791 ref 1 fl Rpc:N/0/0 rc 0/0
> Apr  7 09:09:45 host0 kernel: LustreError: 
> 5312:0:(lib-move.c:2436:LNetPut()) Skipped 1 previous similar message
> Apr  7 09:09:45 host0 kernel: Lustre: 
> lustre-OST0019-osc-ffff810335e15c00: Connection to service 
> lustre-OST0019 via nid 172.17.1.108 at o2ib was lost; in progress 
> operations using this service will wait for recovery to complete.
> Apr  7 09:09:52 host0 kernel: Lustre: 
> 5314:0:(import.c:524:import_select_connection()) 
> lustre-OST0018-osc-ffff810335e15c00: tried all connections, increasing 
> latency to 2s
> Apr  7 09:09:59 host0 kernel: Lustre: 
> 5313:0:(client.c:1434:ptlrpc_expire_one_request()) @@@ Request 
> x1332294902654188 sent from lustre-OST0018-osc-ffff810335e15c00 to NID 
> 172.17.1.108 at o2ib 7s ago has timed out (7s prior to deadline).
> Apr  7 09:09:59 host0 kernel:   req at ffff8104ff9c6c00 
> x1332294902654188/t0 o8->lustre-OST0018_UUID at 172.17.1.108@o2ib:28/4 lens 
> 368/584 e 0 to 1 dl 1270645799 ref 2 fl Rpc:N/0/0 rc 0/0
> Apr  7 09:09:59 host0 kernel: Lustre: 
> 5313:0:(client.c:1434:ptlrpc_expire_one_request()) Skipped 4 previous 
> similar messages
> Apr  7 09:10:00 host0 kernel: Lustre: 
> 5314:0:(import.c:524:import_select_connection()) 
> lustre-OST0018-osc-ffff810335e15c00: tried all connections, increasing 
> latency to 3s
> Apr  7 09:10:00 host0 kernel: Lustre: 
> 5314:0:(import.c:524:import_select_connection()) Skipped 2 previous 
> similar messages
> Apr  7 09:10:08 host0 kernel: Lustre: 
> 5313:0:(client.c:1434:ptlrpc_expire_one_request()) @@@ Request 
> x1332294902658081 sent from lustre-OST0018-osc-ffff810335e15c00 to NID 
> 172.17.1.108 at o2ib 8s ago has timed out (8s prior to deadline).
> Apr  7 09:10:08 host0 kernel:   req at ffff810378e91400 
> x1332294902658081/t0 o8->lustre-OST0018_UUID at 172.17.1.108@o2ib:28/4 lens 
> 368/584 e 0 to 1 dl 1270645808 ref 2 fl Rpc:N/0/0 rc 0/0
> Apr  7 09:10:08 host0 kernel: Lustre: 
> 5313:0:(client.c:1434:ptlrpc_expire_one_request()) Skipped 2 previous 
> similar messages
>
> ~Lawrence
> ~
>
>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>   






More information about the lustre-discuss mailing list