[Lustre-discuss] I/O error on clients

Gabriele Paciucci paciucci at gmail.com
Tue Jul 6 02:23:30 PDT 2010


Hi Peter,
which 10GbE Card do you have? I've solved similar problem with a Netxen 
Card (HP Blade Mezzanine Card) using the nx_nic proprietary driver 
instead of the "open source" driver. In every case the problem is that 
your users fill the network between client and ost !!!

On 07/06/2010 08:19 AM, Peter Kitchener wrote:
> Hi all,
>
> I have been troubleshooting a strange problem that is occurring with 
> our Lustre setup. Under high loads our developers are complaining that 
> various processes they run will error out with I/O error.
>
> Our setup is small 1 MDS and 2 OSS(10OSTs 5/OSS), and 13 Clients (152 
> Cores) the storage is all local 60TB (30TB/OSS) usable in a RAID6 
> Software raid setup.  All of the machines are connected via 10Gig 
> Ethernet. The clients run Rocks 5.3 (CentOS 5.4) and the Servers run 
> CentOS 5.4 with kernel 2.6.18-164.11.1.el5_lustre.1.8.2.  The Clients 
> run an un-patched vanilla kernel from CentOS and Lustre 1.8.3
>
> So far I've not been able to pin point where i should begin to look. I 
> have been trawling through log files that quite frankly don't make 
> much sense to me.
>
> Here is the messages output from the OSS
>
> ##############################
>
> Jul  6 14:57:11 helium kernel: Lustre: AC3-OST0005: haven't heard from 
> client ce1a3eb7-8514-d16e-4050-0507e82f1116 (at 172.16.16.125 at tcp) in 
> 227 seconds. I think it's dead, and I am evicting it.
> Jul  6 15:08:26 helium kernel: Lustre: 
> 6539:0:(ldlm_lib.c:540:target_handle_reconnect()) AC3-OST0006: 
> 593eb160-edce-8ceb-3f93-6f743cacd1a1 reconnecting
> Jul  6 15:08:26 helium kernel: Lustre: 
> 6539:0:(ldlm_lib.c:837:target_handle_connect()) AC3-OST0006: refuse 
> reconnection from 593eb160-edce-8ceb-3f93-6f743cacd1a1 at 10.0.0.54@tcp 
> to 0xffff81026241b800; still busy with 5 active RPCs
> Jul  6 15:08:26 helium kernel: LustreError: 
> 6539:0:(ldlm_lib.c:1848:target_send_reply_msg()) @@@ processing error 
> (-16)  req at ffff810282fe1000 x1340041377953748/t0 
> o8->593eb160-edce-8ceb-3f93-6f743cacd1a1 at NET_0x200000a000036_UUID:0/0 
> lens 368/264 e 0 to 0 dl 1278393006 ref 1 fl Interpret:/0/0 rc -16/0
> Jul  6 15:08:26 helium kernel: LustreError: 
> 6660:0:(ost_handler.c:1061:ost_brw_write()) @@@ Reconnect on bulk GET  
> req at ffff810357883c00 x1340041377934618/t0 
> o4->593eb160-edce-8ceb-3f93-6f743cacd1a1 at NET_0x200000a000036_UUID:0/0 
> lens 448/416 e 0 to 0 dl 1278392947 ref 1 fl Interpret:/0/0 rc 0/0
> Jul  6 15:08:26 helium kernel: LustreError: 
> 6660:0:(ost_handler.c:1061:ost_brw_write()) Skipped 1 previous similar 
> message
> Jul  6 15:08:26 helium kernel: LustreError: 
> 6704:0:(ost_handler.c:1061:ost_brw_write()) @@@ Reconnect on bulk GET  
> req at ffff81082af44050 x1340041377934964/t0 
> o4->593eb160-edce-8ceb-3f93-6f743cacd1a1 at NET_0x200000a000036_UUID:0/0 
> lens 448/416 e 0 to 0 dl 1278392947 ref 1 fl Interpret:/0/0 rc 0/0
> Jul  6 15:08:27 helium kernel: Lustre: 
> 7062:0:(ldlm_lib.c:540:target_handle_reconnect()) AC3-OST0008: 
> 593eb160-edce-8ceb-3f93-6f743cacd1a1 reconnecting
> Jul  6 15:08:27 helium kernel: Lustre: 
> 7062:0:(ldlm_lib.c:837:target_handle_connect()) AC3-OST0008: refuse 
> reconnection from 593eb160-edce-8ceb-3f93-6f743cacd1a1 at 10.0.0.54@tcp 
> to 0xffff81025a535e00; still busy with 3 active RPCs
> Jul  6 15:08:27 helium kernel: LustreError: 
> 7062:0:(ldlm_lib.c:1848:target_send_reply_msg()) @@@ processing error 
> (-16)  req at ffff810698233850 x1340041377955630/t0 
> o8->593eb160-edce-8ceb-3f93-6f743cacd1a1 at NET_0x200000a000036_UUID:0/0 
> lens 368/264 e 0 to 0 dl 1278393007 ref 1 fl Interpret:/0/0 rc -16/0
> Jul  6 15:08:27 helium kernel: Lustre: 
> 6692:0:(ost_handler.c:1219:ost_brw_write()) AC3-OST0006: ignoring bulk 
> IO comm error with 
> 593eb160-edce-8ceb-3f93-6f743cacd1a1 at NET_0x200000a000036_UUID id 
> 12345-10.0.0.54 at tcp - client will retry
> Jul  6 15:08:27 helium kernel: Lustre: 
> 6692:0:(ost_handler.c:1219:ost_brw_write()) Skipped 6 previous similar 
> messages
> Jul  6 15:08:27 helium kernel: LustreError: 
> 6720:0:(ost_handler.c:829:ost_brw_read()) @@@ Reconnect on bulk PUT  
> req at ffff8103a72aec00 x1340041377933615/t0 
> o3->593eb160-edce-8ceb-3f93-6f743cacd1a1 at NET_0x200000a000036_UUID:0/0 
> lens 448/400 e 0 to 0 dl 1278392946 ref 1 fl Interpret:/0/0 rc 0/0
> Jul  6 15:08:27 helium kernel: LustreError: 
> 6720:0:(ost_handler.c:829:ost_brw_read()) Skipped 1 previous similar 
> message
> Jul  6 15:08:29 helium kernel: Lustre: 
> 6720:0:(ost_handler.c:886:ost_brw_read()) AC3-OST0008: ignoring bulk 
> IO comm error with 
> 593eb160-edce-8ceb-3f93-6f743cacd1a1 at NET_0x200000a000036_UUID id 
> 12345-10.0.0.54 at tcp - client will retry
> Jul  6 15:08:29 helium kernel: Lustre: 
> 6720:0:(ost_handler.c:886:ost_brw_read()) Skipped 1 previous similar 
> message
> Jul  6 15:08:37 helium kernel: Lustre: 
> 7058:0:(ldlm_lib.c:540:target_handle_reconnect()) AC3-OST0006: 
> 593eb160-edce-8ceb-3f93-6f743cacd1a1 reconnecting
> Jul  6 15:08:39 helium kernel: Lustre: 
> 6522:0:(ldlm_lib.c:540:target_handle_reconnect()) AC3-OST0008: 
> 593eb160-edce-8ceb-3f93-6f743cacd1a1 reconnecting
> Jul  6 15:10:09 helium kernel: Lustre: 
> 7064:0:(ldlm_lib.c:540:target_handle_reconnect()) AC3-OST0008: 
> e6ce7565-0b86-5925-82fc-015770c5143c reconnecting
> Jul  6 15:10:09 helium kernel: Lustre: 
> 7064:0:(ldlm_lib.c:837:target_handle_connect()) AC3-OST0008: refuse 
> reconnection from e6ce7565-0b86-5925-82fc-015770c5143c at 10.0.0.53@tcp 
> to 0xffff810267bbc400; still busy with 1 active RPCs
> Jul  6 15:10:09 helium kernel: LustreError: 
> 7064:0:(ldlm_lib.c:1848:target_send_reply_msg()) @@@ processing error 
> (-16)  req at ffff81069588b800 x1340036504045855/t0 
> o8->e6ce7565-0b86-5925-82fc-015770c5143c at NET_0x200000a000035_UUID:0/0 
> lens 368/264 e 0 to 0 dl 1278393109 ref 1 fl Interpret:/0/0 rc -16/0
> Jul  6 15:10:09 helium kernel: LustreError: 
> 6649:0:(ost_handler.c:829:ost_brw_read()) @@@ Reconnect on bulk PUT  
> req at ffff81053a96f400 x1340036504013451/t0 
> o3->e6ce7565-0b86-5925-82fc-015770c5143c at NET_0x200000a000035_UUID:0/0 
> lens 448/400 e 0 to 0 dl 1278393053 ref 1 fl Interpret:/0/0 rc 0/0
> Jul  6 15:10:09 helium kernel: LustreError: 
> 6649:0:(ost_handler.c:829:ost_brw_read()) Skipped 2 previous similar 
> messages
> Jul  6 15:10:13 helium kernel: Lustre: 
> 6649:0:(ost_handler.c:886:ost_brw_read()) AC3-OST0008: ignoring bulk 
> IO comm error with 
> e6ce7565-0b86-5925-82fc-015770c5143c at NET_0x200000a000035_UUID id 
> 12345-10.0.0.53 at tcp - client will retry
> Jul  6 15:10:13 helium kernel: Lustre: 
> 6649:0:(ost_handler.c:886:ost_brw_read()) Skipped 2 previous similar 
> messages
> Jul  6 15:10:17 helium kernel: Lustre: 
> 7016:0:(ldlm_lib.c:540:target_handle_reconnect()) AC3-OST0008: 
> e6ce7565-0b86-5925-82fc-015770c5143c reconnecting
> Jul  6 15:10:17 helium kernel: LustreError: 
> 6708:0:(ldlm_lockd.c:305:waiting_locks_callback()) ### lock callback 
> timer expired after 100s: evicting client at 10.0.0.54 at tcp  ns: 
> filter-AC3-OST0006_UUID lock: ffff8103a644e200/0xded0540147d4c8c7 lrc: 
> 3/0,0 mode: PR/PR res: 8432287/0 rrc: 2 type: EXT 
> [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x20 
> remote: 0x6c318a80ee850f9f expref: 878 pid: 7043 timeout 4896983243
> Jul  6 15:10:17 helium kernel: LustreError: 
> 6708:0:(ldlm_lockd.c:305:waiting_locks_callback()) Skipped 1 previous 
> similar message
> Jul  6 15:10:17 helium kernel: LustreError: 
> 6715:0:(ldlm_lib.c:1848:target_send_reply_msg()) @@@ processing error 
> (-107)  req at ffff81027fcc5400 x1340041378188643/t0 o4-><?>@<?>:0/0 lens 
> 448/0 e 0 to 0 dl 1278393069 ref 1 fl Interpret:/0/0 rc -107/0
> Jul  6 15:10:19 helium kernel: LustreError: 
> 0:0:(ldlm_lockd.c:305:waiting_locks_callback()) ### lock callback 
> timer expired after 100s: evicting client at 10.0.0.54 at tcp  ns: 
> filter-AC3-OST0008_UUID lock: ffff810489298400/0xded0540147d4c8b9 lrc: 
> 3/0,0 mode: PR/PR res: 8127131/0 rrc: 2 type: EXT 
> [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x20 
> remote: 0x6c318a80ee850fbb expref: 34 pid: 6592 timeout 4896985868
> Jul  6 15:10:19 helium kernel: LustreError: 
> 6730:0:(ost_handler.c:825:ost_brw_read()) @@@ Eviction on bulk PUT  
> req at ffff810355aeec00 x1340041378188706/t0 
> o3->593eb160-edce-8ceb-3f93-6f743cacd1a1 at NET_0x200000a000036_UUID:0/0 
> lens 448/400 e 0 to 0 dl 1278393069 ref 1 fl Interpret:/0/0 rc 0/0
> Jul  6 15:10:19 helium kernel: LustreError: 
> 6665:0:(ost_handler.c:825:ost_brw_read()) @@@ Eviction on bulk PUT  
> req at ffff81069e442400 x1340041378188736/t0 
> o3->593eb160-edce-8ceb-3f93-6f743cacd1a1 at NET_0x200000a000036_UUID:0/0 
> lens 448/400 e 0 to 0 dl 1278393069 ref 1 fl Interpret:/0/0 rc 0/0
> Jul  6 15:10:19 helium kernel: LustreError: 
> 6665:0:(ost_handler.c:825:ost_brw_read()) Skipped 5 previous similar 
> messages
> Jul  6 15:10:20 helium kernel: Lustre: 
> 6730:0:(ost_handler.c:886:ost_brw_read()) AC3-OST0008: ignoring bulk 
> IO comm error with 
> 593eb160-edce-8ceb-3f93-6f743cacd1a1 at NET_0x200000a000036_UUID id 
> 12345-10.0.0.54 at tcp - client will retry
> Jul  6 15:10:20 helium kernel: LustreError: 
> 6714:0:(ldlm_lib.c:1848:target_send_reply_msg()) @@@ processing error 
> (-107)  req at ffff8106965e0800 x1340041378196268/t0 o3-><?>@<?>:0/0 lens 
> 448/0 e 0 to 0 dl 1278393072 ref 1 fl Interpret:/0/0 rc -107/0
> Jul  6 15:11:10 helium kernel: Lustre: 
> 7119:0:(ldlm_lib.c:540:target_handle_reconnect()) AC3-OST0009: 
> d9a4606b-2d46-7e5b-c67d-c05610d8af95 reconnecting
> Jul  6 15:18:00 helium kernel: Lustre: 
> 6979:0:(ldlm_lib.c:540:target_handle_reconnect()) AC3-OST0009: 
> e6ce7565-0b86-5925-82fc-015770c5143c reconnecting
> Jul  6 15:30:40 helium kernel: Lustre: 
> 7119:0:(client.c:1434:ptlrpc_expire_one_request()) @@@ Request 
> x1339860850745218 sent from AC3-OST0006 to NID 10.0.0.54 at tcp 7s ago 
> has timed out (7s prior to deadline).
> Jul  6 15:30:40 helium kernel:   req at ffff8104fd018000 
> x1339860850745218/t0 o104->@NET_0x200000a000036_UUID:15/16 lens 
> 296/384 e 0 to 1 dl 1278394240 ref 2 fl Rpc:N/0/0 rc 0/0
> Jul  6 15:30:40 helium kernel: LustreError: 138-a: AC3-OST0006: A 
> client on nid 10.0.0.54 at tcp was evicted due to a lock blocking 
> callback to 10.0.0.54 at tcp timed out: rc -107
> Jul  6 15:30:40 helium kernel: LustreError: 
> 7119:0:(ldlm_lockd.c:1167:ldlm_handle_enqueue()) ### lock on destroyed 
> export ffff8102688f7400 ns: filter-AC3-OST0006_UUID lock: 
> ffff81012bdcb200/0xded0540147d6bb48 lrc: 3/0,0 mode: --/PW res: 
> 8432457/0 rrc: 2 type: EXT [0->376831] (req 0->376831) flags: 0x0 
> remote: 0x6c318a80f18346f1 expref: 357 pid: 7119 timeout 0
> Jul  6 15:30:40 helium kernel: LustreError: 
> 7119:0:(ldlm_lib.c:1848:target_send_reply_msg()) @@@ processing error 
> (-107)  req at ffff8103870e8400 x1340041383514587/t0 
> o101->593eb160-edce-8ceb-3f93-6f743cacd1a1 at NET_0x200000a000036_UUID:0/0 lens 
> 296/352 e 0 to 0 dl 1278394284 ref 1 fl Interpret:/0/0 rc -107/0
> Jul  6 15:30:41 helium kernel: LustreError: 
> 6679:0:(ost_handler.c:825:ost_brw_read()) @@@ Eviction on bulk PUT  
> req at ffff81082a335c50 x1340041383502349/t0 
> o3->593eb160-edce-8ceb-3f93-6f743cacd1a1 at NET_0x200000a000036_UUID:0/0 
> lens 448/400 e 0 to 0 dl 1278394307 ref 1 fl Interpret:/0/0 rc 0/0
> Jul  6 15:30:41 helium kernel: Lustre: 
> 6679:0:(ost_handler.c:886:ost_brw_read()) AC3-OST0006: ignoring bulk 
> IO comm error with 
> 593eb160-edce-8ceb-3f93-6f743cacd1a1 at NET_0x200000a000036_UUID id 
> 12345-10.0.0.54 at tcp - client will retry
> Jul  6 15:30:41 helium kernel: Lustre: 
> 6679:0:(ost_handler.c:886:ost_brw_read()) Skipped 6 previous similar 
> messages
> Jul  6 15:30:41 helium kernel: LustreError: 
> 8166:0:(ldlm_lockd.c:1821:ldlm_cancel_handler()) operation 103 from 
> 12345-10.0.0.54 at tcp with bad export cookie 16055425036052657553
> Jul  6 15:32:08 helium kernel: Lustre: 
> 7134:0:(client.c:1434:ptlrpc_expire_one_request()) @@@ Request 
> x1339860850745435 sent from AC3-OST0005 to NID 10.0.0.56 at tcp 7s ago 
> has timed out (7s prior to deadline).
> Jul  6 15:32:08 helium kernel:   req at ffff810695874000 
> x1339860850745435/t0 o104->@:15/16 lens 296/384 e 0 to 1 dl 1278394328 
> ref 2 fl Rpc:N/0/0 rc 0/0
> Jul  6 15:32:08 helium kernel: LustreError: 138-a: AC3-OST0005: A 
> client on nid 10.0.0.56 at tcp was evicted due to a lock blocking 
> callback to 10.0.0.56 at tcp timed out: rc -107
> Jul  6 15:32:08 helium kernel: LustreError: 
> 7134:0:(ldlm_lockd.c:1167:ldlm_handle_enqueue()) ### lock on destroyed 
> export ffff8102552f5400 ns: filter-AC3-OST0005_UUID lock: 
> ffff810485fade00/0xded0540147d6ead5 lrc: 3/0,0 mode: --/PW res: 
> 8108669/0 rrc: 2 type: EXT [0->18446744073709551615] (req 
> 0->18446744073709551615) flags: 0x0 remote: 0xd1f8ae5995b5c83b expref: 
> 620 pid: 7134 timeout 0
> Jul  6 15:32:08 helium kernel: LustreError: 
> 7134:0:(ldlm_lib.c:1848:target_send_reply_msg()) @@@ processing error 
> (-107)  req at ffff81009ddd6800 x1340491441560517/t0 
> o101->8a1862dd-8de6-9414-f71c-0c85925e1e20@:0/0 lens 296/352 e 0 to 0 
> dl 1278394372 ref 1 fl Interpret:/0/0 rc -107/0
> Jul  6 15:32:08 helium kernel: LustreError: 
> 6610:0:(ost_handler.c:825:ost_brw_read()) @@@ Eviction on bulk PUT  
> req at ffff810357887c00 x1340491441559786/t0 
> o3->8a1862dd-8de6-9414-f71c-0c85925e1e20@:0/0 lens 448/400 e 0 to 0 dl 
> 1278394396 ref 1 fl Interpret:/0/0 rc 0/0
> Jul  6 15:32:08 helium kernel: LustreError: 
> 6610:0:(ost_handler.c:825:ost_brw_read()) Skipped 3 previous similar 
> messages
> Jul  6 15:32:08 helium kernel: Lustre: 
> 6610:0:(ost_handler.c:886:ost_brw_read()) AC3-OST0005: ignoring bulk 
> IO comm error with 8a1862dd-8de6-9414-f71c-0c85925e1e20@ id 
> 12345-10.0.0.56 at tcp - client will retry
> Jul  6 15:32:08 helium kernel: Lustre: 
> 6610:0:(ost_handler.c:886:ost_brw_read()) Skipped 3 previous similar 
> messages
> Jul  6 15:32:08 helium kernel: LustreError: 
> 6720:0:(ost_handler.c:825:ost_brw_read()) @@@ Eviction on bulk PUT  
> req at ffff81040de33000 x1340491441560035/t0 
> o3->8a1862dd-8de6-9414-f71c-0c85925e1e20@:0/0 lens 448/400 e 0 to 0 dl 
> 1278394396 ref 1 fl Interpret:/0/0 rc 0/0
> Jul  6 15:32:16 helium kernel: LustreError: 
> 12001:0:(ldlm_lockd.c:1821:ldlm_cancel_handler()) operation 103 from 
> 12345-10.0.0.56 at tcp with bad export cookie 16055425036052176730
> Jul  6 15:32:19 helium kernel: Lustre: 
> 6720:0:(ost_handler.c:886:ost_brw_read()) AC3-OST0005: ignoring bulk 
> IO comm error with 8a1862dd-8de6-9414-f71c-0c85925e1e20@ id 
> 12345-10.0.0.56 at tcp - client will retry
> Jul  6 15:32:21 helium kernel: Lustre: 
> 6712:0:(ost_handler.c:886:ost_brw_read()) AC3-OST0005: ignoring bulk 
> IO comm error with 8a1862dd-8de6-9414-f71c-0c85925e1e20@ id 
> 12345-10.0.0.56 at tcp - client will retry
> Jul  6 15:32:21 helium kernel: Lustre: 
> 6712:0:(ost_handler.c:886:ost_brw_read()) Skipped 2 previous similar 
> messages
> Jul  6 15:36:56 helium kernel: Lustre: 
> 6970:0:(client.c:1434:ptlrpc_expire_one_request()) @@@ Request 
> x1339860850746067 sent from AC3-OST0006 to NID 10.0.0.57 at tcp 11s ago 
> has timed out (11s prior to deadline).
> Jul  6 15:36:56 helium kernel:   req at ffff810086c83000 
> x1339860850746067/t0 o106->@:15/16 lens 296/424 e 0 to 1 dl 1278394616 
> ref 2 fl Rpc:/0/0 rc 0/0
> Jul  6 15:37:01 helium diskmond: 168:Polling all 48 slots for drive fault
> Jul  6 15:37:10 helium diskmond: sata4/5 device(/dev/sdal) is running bad
> Jul  6 15:37:10 helium diskmond: please back up and replace the disk 
> soon.
> Jul  6 15:38:05 helium kernel: Lustre: 
> 6537:0:(client.c:1434:ptlrpc_expire_one_request()) @@@ Request 
> x1339860850746176 sent from AC3-OST0005 to NID 10.0.0.57 at tcp 8s ago 
> has timed out (8s prior to deadline).
> Jul  6 15:38:05 helium kernel:   req at ffff810341fdc800 
> x1339860850746176/t0 o106->@:15/16 lens 296/424 e 0 to 1 dl 1278394685 
> ref 2 fl Rpc:/0/0 rc 0/0
> Jul  6 15:38:18 helium kernel: Lustre: 
> 6524:0:(client.c:1434:ptlrpc_expire_one_request()) @@@ Request 
> x1339860850746195 sent from AC3-OST0008 to NID 10.0.0.57 at tcp 7s ago 
> has timed out (7s prior to deadline).
> Jul  6 15:38:18 helium kernel:   req at ffff810282961400 
> x1339860850746195/t0 o106->@:15/16 lens 296/424 e 0 to 1 dl 1278394698 
> ref 2 fl Rpc:/0/0 rc 0/0
> Jul  6 15:39:07 helium kernel: Lustre: 
> 6793:0:(client.c:1434:ptlrpc_expire_one_request()) @@@ Request 
> x1339860850746269 sent from AC3-OST0006 to NID 10.0.0.57 at tcp 11s ago 
> has timed out (11s prior to deadline).
> Jul  6 15:39:07 helium kernel:   req at ffff81027ce5b400 
> x1339860850746269/t0 o106->@:15/16 lens 296/424 e 0 to 1 dl 1278394747 
> ref 2 fl Rpc:/0/0 rc 0/0
> Jul  6 15:40:19 helium kernel: Lustre: 
> 6851:0:(client.c:1434:ptlrpc_expire_one_request()) @@@ Request 
> x1339860850746437 sent from AC3-OST0005 to NID 10.0.0.57 at tcp 8s ago 
> has timed out (8s prior to deadline).
> Jul  6 15:40:19 helium kernel:   req at ffff8104135bb800 
> x1339860850746437/t0 o106->@:15/16 lens 296/424 e 0 to 1 dl 1278394819 
> ref 2 fl Rpc:/0/0 rc 0/0
> Jul  6 15:42:10 helium kernel: Lustre: 
> 6935:0:(client.c:1434:ptlrpc_expire_one_request()) @@@ Request 
> x1339860850746615 sent from AC3-OST0005 to NID 10.0.0.53 at tcp 7s ago 
> has timed out (7s prior to deadline).
> Jul  6 15:42:10 helium kernel:   req at ffff81025b227c00 
> x1339860850746615/t0 o106->@NET_0x200000a000035_UUID:15/16 lens 
> 296/424 e 0 to 1 dl 1278394930 ref 2 fl Rpc:/0/0 rc 0/0
> Jul  6 15:46:01 helium kernel: Lustre: 
> 6957:0:(client.c:1434:ptlrpc_expire_one_request()) @@@ Request 
> x1339860850746990 sent from AC3-OST0007 to NID 172.16.16.124 at tcp 7s 
> ago has timed out (7s prior to deadline).
> Jul  6 15:46:01 helium kernel:   req at ffff8103fabd9800 
> x1339860850746990/t0 o106->@NET_0x20000ac10107c_UUID:15/16 lens 
> 296/424 e 0 to 1 dl 1278395161 ref 2 fl Rpc:/0/0 rc 0/0
>
>
> ########################
>
> Here is the output from the client at the same time
>
> Jul  6 15:10:17 compute-0-3 kernel: LustreError: 11-0: an error 
> occurred while communicating with 172.16.16.2 at tcp. The ost_write 
> operation failed with -107
> Jul  6 15:10:17 compute-0-3 kernel: LustreError: Skipped 1 previous 
> similar message
> Jul  6 15:10:17 compute-0-3 kernel: LustreError: 167-0: This client 
> was evicted by AC3-OST0006; in progress operations using this service 
> will fail.
> Jul  6 15:10:17 compute-0-3 kernel: LustreError: Skipped 4 previous 
> similar messages
> Jul  6 15:10:17 compute-0-3 kernel: LustreError: 
> 3095:0:(namei.c:1176:ll_objects_destroy()) obd destroy objid 
> 0x18542c4 at 0x0 error -5
> Jul  6 15:10:17 compute-0-3 kernel: LustreError: 
> 6419:0:(file.c:995:ll_glimpse_size()) obd_enqueue returned rc -5, 
> returning -EIO
> Jul  6 15:10:18 compute-0-3 kernel: LustreError: 
> 6779:0:(ldlm_resource.c:518:ldlm_namespace_cleanup()) Namespace 
> AC3-OST0006-osc-ffff81043e754c00 resource r
> efcount nonzero (1) after lock cleanup; forcing cleanup.
> Jul  6 15:10:18 compute-0-3 kernel: LustreError: 
> 6779:0:(ldlm_resource.c:523:ldlm_namespace_cleanup()) Resource: 
> ffff810117997500 (8432287/0/0/0) (rc: 1)
> Jul  6 15:10:18 compute-0-3 kernel: LustreError: 
> 6687:0:(llite_mmap.c:210:ll_tree_unlock()) couldn't unlock -5
> Jul  6 15:10:18 compute-0-3 kernel: LustreError: 
> 6782:0:(file.c:995:ll_glimpse_size()) obd_enqueue returned rc -5, 
> returning -EIO
> Jul  6 15:10:20 compute-0-3 kernel: LustreError: 11-0: an error 
> occurred while communicating with 172.16.16.2 at tcp. The ost_read 
> operation failed with -107
> Jul  6 15:10:20 compute-0-3 kernel: LustreError: 167-0: This client 
> was evicted by AC3-OST0008; in progress operations using this service 
> will fail.
> Jul  6 15:10:20 compute-0-3 kernel: LustreError: 
> 20660:0:(rw.c:122:ll_brw()) error from obd_brw: rc = -4
> Jul  6 15:10:20 compute-0-3 kernel: LustreError: 
> 6784:0:(ldlm_resource.c:518:ldlm_namespace_cleanup()) Namespace 
> AC3-OST0008-osc-ffff81043e754c00 resource r
> efcount nonzero (1) after lock cleanup; forcing cleanup.
> Jul  6 15:10:20 compute-0-3 kernel: LustreError: 
> 6784:0:(ldlm_resource.c:523:ldlm_namespace_cleanup()) Resource: 
> ffff81041604b9c0 (5521743/0/0/0) (rc: 1)
> Jul  6 15:10:20 compute-0-3 kernel: LustreError: 
> 6686:0:(llite_mmap.c:210:ll_tree_unlock()) couldn't unlock -5
> Jul  6 15:10:20 compute-0-3 kernel: LustreError: 
> 3571:0:(client.c:858:ptlrpc_import_delay_req()) @@@ IMP_INVALID 
>  req at ffff81042061a800 x1340041378196315/t0
> o4->AC3-OST0008_UUID at 172.16.16.2@tcp:6/4 lens 512/624 e 0 to 1 dl 0 
> ref 2 fl Rpc:/0/0 rc 0/0Jul  6 15:10:20 compute-0-3 kernel: 
> LustreError: 3571:0:(client.c:858:ptlrpc_import_delay_req()) Skipped 
> 78 previous similar messages
> Jul  6 15:10:20 compute-0-3 kernel: LustreError: 
> 6785:0:(file.c:995:ll_glimpse_size()) obd_enqueue returned rc -5, 
> returning -EIO
>
> Kind Regards,
>
> Peter Kitchener
> Systems Administrator
> Capital Markets CRC Limited (CMCRC)
> Telephone: +61 2 8088 4223
> Fax: +61 2 8088 4201
> www.cmcrc.com <http://www.cmcrc.com/>
>
>
>
>
>
> Capital Markets CRC Ltd - Confidential Communication
> The information contained in this e-mail is confidential.  It is 
> intended for the addressee only.  If you receive this e-mail by 
> mistake please promptly inform us by reply e-mail and then delete the 
> e-mail and destroy any printed copy. You must not disclose or use in 
> any way the information in the e-mail. There is no warranty that this 
> e-mail is error or virus free. It may be a private communication, and 
> if so, does not represent the views of the CMCRC and its associates. 
>  If it is a private communication, care should be taken in opening it 
> to ensure that undue offence is not given.
>
>
>
>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>    


-- 
_Gabriele Paciucci_ http://www.linkedin.com/in/paciucci

Pursuant to legislative Decree n. 196/03 you are hereby informed that this email contains confidential information intended only for use of addressee. If you are not the addressee and have received this email by mistake, please send this email to the sender. You may not copy or disseminate this message to anyone. Thank You.




More information about the lustre-discuss mailing list