[Lustre-discuss] Frequent OSS Crashes with heavy load

Jim Harm harm1 at llnl.gov
Thu Nov 13 15:53:41 PST 2008


We have experienced all these errors when we have a big job that is 
writing many small chunks.
when the writes are ... say 80 bytes and the block size is 4k bytes, 
the back end storage can
slow down with read block, modify block, write block, to such and 
extent as to cause the slow commitrw
and slow journal messages very similar to yours.
from Your email:
Dear all,
     This is a piece of error log:
     Nov 13 18:25:26 boss02 kernel: Lustre: 
27228:0:(filter_io_26.c:700:filter_commitrw_write()) Skipped 56 
previous similar messages
Nov 13 18:25:26 boss02 kernel: Lustre: 
27176:0:(lustre_fsfilt.h:246:fsfilt_brw_start_log()) besfs-OST0004: 
slow journal start 47s
Nov 13 18:25:26 boss02 kernel: Lustre: 
27231:0:(filter_io_26.c:713:filter_commitrw_write()) besfs-OST0004: 
slow brw_start 47s
Nov 13 18:25:26 boss02 kernel: Lustre: 
27231:0:(filter_io_26.c:713:filter_commitrw_write()) Skipped 8 
previous similar messages
Nov 13 18:25:26 boss02 kernel: Lustre: 
27176:0:(lustre_fsfilt.h:246:fsfilt_brw_start_log()) Skipped 10 
previous similar messages
Nov 13 18:25:26 boss02 kernel: Lustre: 
27278:0:(filter_io_26.c:765:filter_commitrw_write()) besfs-OST0004: 
slow direct_io 47s
Nov 13 18:25:26 boss02 kernel: Lustre: 
27235:0:(lustre_fsfilt.h:302:fsfilt_commit_wait()) besfs-OST0004: 
slow journal start 47s

You may check for a job that is confirming small writes instead of 
caching and writing Mbytes.
we have even seen this phenomenon back up the server to the extent 
that it will appear to the client
that it is time to try the failover server, which fails.
just something to check.

At 8:58 AM -0500 11/13/08, Brian J. Murrell wrote:
>Content-type: multipart/signed; boundary="=-8X67HSS1Wp3J4Z9OBh3u";
>	protocol="application/pgp-signature"; micalg=pgp-sha1
>
>There is really no need to put both andreas and myself into your new
>message recipient addresses.  We are both on the lustre-discuss list.
>
>On Thu, 2008-11-13 at 19:32 +0800, wanglu wrote:
>>  <----I could not log in from SSH here and went to the console-->
>>  <---What I saw--->
>>  ...
>>  Nov 13 18:35:02 boss02 kernel: LustreError: 
>>26928:0:(socklnd.c:1613:ksocknal_destroy_conn()) Completing partial 
>>receive from 12345-192.168.52.79 at tcp, ip 192.168.52.79:1021, with 
>>error
>>  Nov 13 18:35:02 boss02 kernel: LustreError: 
>>26928:0:(events.c:361:server_bulk_callback()) event type 2, status 
>>-5, desc e1c24000
>>  Nov 13 18:35:02 boss02 kernel: LustreError: 
>>17941:0:(ost_handler.c:1139:ost_brw_write()) @@@ network error on 
>>bulk GET 0(1048576)  req at ea8cd200 x10376088/t0 
>>o4->b99b0138-d1de-93db-0418-c08eeb8c4b57 at NET_0x20000c0a8344f_UUID:0/0 
>>lens 384/352 e 0 to 0 dl 1226573467 ref 1 fl Interpret:/0/0 rc 0/0
> 
>^^^^^^^^^^^^^^^^^^
>>  Nov 13 18:35:02 boss02 kernel: Lustre: 
>>17941:0:(ost_handler.c:1270:ost_brw_write()) besfs-OST0001: 
>>ignoring bulk IO comm error with 
>>b99b0138-d1de-93db-0418-c08eeb8c4b57 at NET_0x20000c0a8344f_UUID id 
>>12345-192.168.52.79 at tcp - client will retry
>[ Many more ]
>>
>>  <---At that time, the network was down, couldn't ping gateway-->
>>  <--I have tried restart service network, but after restarted, 
>>gateway was still unreachable--->
>
>You have networking problems, not Lustre problems.  Lustre only utilizes
>whatever network you provide it.  It does not control it.  It does not
>bring it up, take it down or reconfigure it in any way.  Your operating
>system does this.
>
>b.
>
>
>Content-Type: application/pgp-signature; name="signature.asc"
>Content-Description: This is a digitally signed message part
>
>Attachment converted: PowerBook HD:signature.asc (    /    ) (001D608B)
>_______________________________________________
>Lustre-discuss mailing list
>Lustre-discuss at lists.lustre.org
>http:// lists.lustre.org/mailman/listinfo/lustre-discuss


-- 
}}}===============>>  LLNL
James E. Harm (Jim); jharm at llnl.gov
System Administrator, ICCD Clusters
(925) 422-4018 Page: 423-7705x57152



More information about the lustre-discuss mailing list