[Lustre-discuss] odd pauses when writing

John White jwhite at lbl.gov
Thu Nov 8 09:24:59 PST 2012


Good Morning Folks,
	We're (seemingly suddenly) getting some fairly odd IO pauses of about 20-30 seconds during client writes into one of our file systems (specifically an rsync from an NFS to a Lustre).  On the client, we're seeing blocks similar to the following when the pause occurs:
Nov  8 09:19:50 lrc-xfer.scs00 lrc-xfer kernel: LustreError: 1809:0:(events.c:198:client_bulk_callback()) event type 0, status -5, desc ffff880080ec4000
Nov  8 09:19:50 lrc-xfer.scs00 lrc-xfer kernel: LustreError: 1819:0:(events.c:198:client_bulk_callback()) event type 0, status -5, desc ffff880034c72000
Nov  8 09:19:50 lrc-xfer.scs00 lrc-xfer kernel: LustreError: 1809:0:(events.c:198:client_bulk_callback()) event type 0, status -113, desc ffff8803c6658000
Nov  8 09:19:50 lrc-xfer.scs00 lrc-xfer kernel: LustreError: 1809:0:(events.c:198:client_bulk_callback()) event type 0, status -5, desc ffff8805a283e000
Nov  8 09:19:50 lrc-xfer.scs00 lrc-xfer kernel: LustreError: 1809:0:(events.c:198:client_bulk_callback()) event type 0, status -5, desc ffff8805b1b0e000
Nov  8 09:19:50 lrc-xfer.scs00 lrc-xfer kernel: LustreError: 1809:0:(events.c:198:client_bulk_callback()) event type 0, status -5, desc ffff8805ca086000
Nov  8 09:19:50 lrc-xfer.scs00 lrc-xfer kernel: LustreError: 1809:0:(events.c:198:client_bulk_callback()) event type 0, status -5, desc ffff88054b762000
Nov  8 09:19:50 lrc-xfer.scs00 lrc-xfer kernel: LustreError: 1809:0:(events.c:198:client_bulk_callback()) event type 0, status -5, desc ffff8805ae49c000
Nov  8 09:19:50 lrc-xfer.scs00 lrc-xfer kernel: LustreError: 1809:0:(events.c:198:client_bulk_callback()) event type 0, status -5, desc ffff88045cb74000

On the OSS, we can see (note: 10.0.2.8 is the client in question):

Nov  8 09:21:18 n0002.lustre LustreError: 8731:0:(socklnd.c:1671:ksocknal_destroy_conn()) Completing partial receive from 12345-10.0.2.8 at tcp[2], ip 10.0.2.8:1021, with error, wanted: 8192, left: 8192, last alive is 1 secs ago 
Nov  8 09:21:18 n0002.lustre kernel: LustreError: 8731:0:(socklnd.c:1671:ksocknal_destroy_conn()) Completing partial receive from 12345-10.0.2.8 at tcp[2], ip 10.0.2.8:1021, with error, wanted: 8192, left: 8192, last alive is 1 secs ago 
Nov  8 09:21:18 n0002.lustre kernel: LustreError: 8731:0:(events.c:381:server_bulk_callback()) event type 2, status -5, desc ffff8103be200000 
Nov  8 09:21:18 n0002.lustre LustreError: 8731:0:(events.c:381:server_bulk_callback()) event type 2, status -5, desc ffff8103be200000 
Nov  8 09:21:18 n0002.lustre LustreError: 9141:0:(ost_handler.c:1073:ost_brw_write()) @@@ network error on bulk GET 0(1048576)  req at ffff8104178a6c00 x1412852387822649/t0 o4->81cf6d57-d07f-6bef-2fef-ca8a980c718e@:0/0 lens 448/416 e 1 to 0 dl 1352395330 ref 1 fl Interpret:/0/0 rc 0/0 
Nov  8 09:21:18 n0002.lustre kernel: LustreError: 9141:0:(ost_handler.c:1073:ost_brw_write()) @@@ network error on bulk GET 0(1048576)  req at ffff8104178a6c00 x1412852387822649/t0 o4->81cf6d57-d07f-6bef-2fef-ca8a980c718e@:0/0 lens 448/416 e 1 to 0 dl 1352395330 ref 1 fl Interpret:/0/0 rc 0/0 
Nov  8 09:21:18 n0002.lustre Lustre: 9141:0:(ost_handler.c:1224:ost_brw_write()) lrc-OST0009: ignoring bulk IO comm error with 81cf6d57-d07f-6bef-2fef-ca8a980c718e@ id 12345-10.0.2.8 at tcp - client will retry 
Nov  8 09:21:18 n0002.lustre kernel: Lustre: 9141:0:(ost_handler.c:1224:ost_brw_write()) lrc-OST0009: ignoring bulk IO comm error with 81cf6d57-d07f-6bef-2fef-ca8a980c718e@ id 12345-10.0.2.8 at tcp - client will retry 
Nov  8 09:21:24 n0002.lustre Lustre: 8978:0:(ldlm_lib.c:574:target_handle_reconnect()) lrc-OST0004: 81cf6d57-d07f-6bef-2fef-ca8a980c718e reconnecting 
Nov  8 09:21:24 n0002.lustre Lustre: 8978:0:(ldlm_lib.c:574:target_handle_reconnect()) Skipped 5 previous similar messages 
Nov  8 09:21:24 n0002.lustre kernel: Lustre: 8978:0:(ldlm_lib.c:574:target_handle_reconnect()) lrc-OST0004: 81cf6d57-d07f-6bef-2fef-ca8a980c718e reconnecting 
Nov  8 09:21:24 n0002.lustre kernel: Lustre: 8978:0:(ldlm_lib.c:574:target_handle_reconnect()) Skipped 5 previous similar messages 

Any ideas as to a cause?  Is this network loss?
----------------
John White
HPC Systems Engineer
(510) 486-7307
One Cyclotron Rd, MS: 50C-3209C
Lawrence Berkeley National Lab
Berkeley, CA 94720




More information about the lustre-discuss mailing list