<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 TRANSITIONAL//EN">

<HTML>

<HEAD>

  <META HTTP-EQUIV="Content-Type" CONTENT="text/html; CHARSET=UTF-8">

  <META NAME="GENERATOR" CONTENT="GtkHTML/3.10.3">

</HEAD>

<BODY>

<BR>

Thanks Andreas.  I'll reconfigure the RAID and give it another shot today.  Would it be reasonable to credit the stalled writes with this I/O mismatch I have?<BR>

<BR>

Dan<BR>

<BR>

<BR>

On Thu, 2008-01-31 at 01:40 -0700, Andreas Dilger wrote:

<BLOCKQUOTE TYPE=CITE>

<PRE>

<FONT COLOR="#000000">On Jan 30, 2008  18:32 -0800, Dan wrote:</FONT>

<FONT COLOR="#000000">> I was a little uncertain of the stripe size calculation so here we go...</FONT>

<FONT COLOR="#000000">> My chunk size is 128k and there are 23 disks in RAID 6 (one hot spare</FONT>

<FONT COLOR="#000000">> leave 23).  That means 21 data disks?  Judging by your formula I take 23 *</FONT>

<FONT COLOR="#000000">> 128k whis is 2944.  Is this even close to what you intended?  This stripe</FONT>

<FONT COLOR="#000000">> size hangs at mount...</FONT>


<FONT COLOR="#000000">Hmm, I don't think the mballoc code can efficiently deal with a stripe size </FONT>

<FONT COLOR="#000000">larger than the RPC size (which is 1MB) because this will always result in</FONT>

<FONT COLOR="#000000">a read-modify-write of the RAID stripe as not enough data can be collected</FONT>

<FONT COLOR="#000000">to fill a stripe.</FONT>


<FONT COLOR="#000000">> I've tried to test with the lustre-io kit but the tests (writes) fail on</FONT>

<FONT COLOR="#000000">> most OSTs.  That is the problem I'm having after all... frustrating.</FONT>

<FONT COLOR="#000000">> </FONT>

<FONT COLOR="#000000">> Would it make sense to reconfigure the RAID controllers to have separate</FONT>

<FONT COLOR="#000000">> groups of disks in RAID 6?  For performance is there a recommended max</FONT>

<FONT COLOR="#000000">> size or number of disks for each OST?  Lastly, is it worth while to</FONT>

<FONT COLOR="#000000">> consider putting the ext3 journal on another device exported from the RAID</FONT>

<FONT COLOR="#000000">> controller?</FONT>


<FONT COLOR="#000000">Having 21 disks in the RAID set is probably too large to be practical</FONT>

<FONT COLOR="#000000">because of the high overhead of doing IO of such a large size.</FONT>

<FONT COLOR="#000000">Good configurations for such a system might be 2x 8+2 + spare = 21 disks</FONT>

<FONT COLOR="#000000">with 128kB chunk size, or 16+2 + spare = 19 disks with 64kB chunk size.</FONT>

<FONT COLOR="#000000">Both result in 1MB full stripe size, which is what mballoc and Lustre</FONT>

<FONT COLOR="#000000">are optimized to by default.</FONT>


<FONT COLOR="#000000">> > On Jan 18, 2008  16:45 -0800, Dan wrote:</FONT>

<FONT COLOR="#000000">> >>     I'm looking for some advice on improving disk performance and</FONT>

<FONT COLOR="#000000">> >> understanding what Lustre is doing with it.  Right now I have a ~28 TB</FONT>

<FONT COLOR="#000000">> >> OSS with 4 OSTs on it.  There are 4 clients using Lustre native - no</FONT>

<FONT COLOR="#000000">> >> NFS.  If I write to the lustre volume from the clients I get odd</FONT>

<FONT COLOR="#000000">> >> behavior.  Typically the writes have a long pause before any data</FONT>

<FONT COLOR="#000000">> >> starts hitting the disks.  Then 2 or 3 of the clients will write</FONT>

<FONT COLOR="#000000">> >> happily but one or two will not.  Eventually Lustre will pump out a</FONT>

<FONT COLOR="#000000">> >> number of I/O related errors such as "slow i_mutex 165 seconds, slow</FONT>

<FONT COLOR="#000000">> >> direct_io 32 seconds" and so on.  Next the clients that couldn't write</FONT>

<FONT COLOR="#000000">> >> will catch up and pass the clients that could write.  At some point (5</FONT>

<FONT COLOR="#000000">> >> minutes or so) the jobs start failing without any errors.  New jobs</FONT>

<FONT COLOR="#000000">> >> can be started after these fail and the pattern repeats.  Write speeds</FONT>

<FONT COLOR="#000000">> >> are low, around 22 MB/sec per client, the disks shouldn't have any</FONT>

<FONT COLOR="#000000">> >> problem handling 4 writes at this speed!!  This did work using NFS.</FONT>

<FONT COLOR="#000000">> >></FONT>

<FONT COLOR="#000000">> >>     When these disks were formated with XFS I/O was fast.  No problems</FONT>

<FONT COLOR="#000000">> >> at</FONT>

<FONT COLOR="#000000">> >> all writing 475 MB/sec sustained per RAID controller (locally, not via</FONT>

<FONT COLOR="#000000">> >> NFS).  No delays.  After configuring for Lustre the peak sustained</FONT>

<FONT COLOR="#000000">> >> write (locally) is 230 MB/sec.  It will write for about 2 minutes</FONT>

<FONT COLOR="#000000">> >> before logging about slow I/O.  This is without any clients connected.</FONT>

<FONT COLOR="#000000">> >></FONT>

<FONT COLOR="#000000">> >> So far I've done the following:</FONT>

<FONT COLOR="#000000">> >></FONT>

<FONT COLOR="#000000">> >> 1.  Recompiled SCSI driver for RAID controller to use 1 MB blocks (from</FONT>

<FONT COLOR="#000000">> >> 256k).</FONT>

<FONT COLOR="#000000">> >> 2.  Adjusted MDS, OST threads</FONT>

<FONT COLOR="#000000">> >> 3.  Tried all I/O schedulers</FONT>

<FONT COLOR="#000000">> >> 4.  Tried all possible settings on RAID controllers for Caching and</FONT>

<FONT COLOR="#000000">> >> read-ahead.</FONT>

<FONT COLOR="#000000">> >> 5.  Some minor stuff I forgot about!</FONT>

<FONT COLOR="#000000">> >></FONT>

<FONT COLOR="#000000">> >> Nothing makes a difference - same results under each configuration</FONT>

<FONT COLOR="#000000">> >> except</FONT>

<FONT COLOR="#000000">> >> for schedulers.  When running the deadline scheduler the writes fail</FONT>

<FONT COLOR="#000000">> >> faster and have delays around 30 seconds.  With all others the delays</FONT>

<FONT COLOR="#000000">> >> range from 100 to 500 seconds.</FONT>

<FONT COLOR="#000000">> >></FONT>

<FONT COLOR="#000000">> >> The system has 4 cores and 4 GB of memory with 4 7 TB OSTs.  The disks</FONT>

<FONT COLOR="#000000">> >> are</FONT>

<FONT COLOR="#000000">> >> in RAID 6 split between two controllers with 2 GB cache each.  One</FONT>

<FONT COLOR="#000000">> >> controller has the MGS/MDT on it.  When running top it indicates 2/3 to</FONT>

<FONT COLOR="#000000">> >> 3/4 of memory utilized and 25% CPU utilization normally.</FONT>

<FONT COLOR="#000000">> ></FONT>

<FONT COLOR="#000000">> > Are you using Lustre 1.4 or 1.6?  Are you mounting your OSTs with</FONT>

<FONT COLOR="#000000">> > "-o extents,mballoc"?  We've had Lustre OSSs nodes running in excess</FONT>

<FONT COLOR="#000000">> > of 2GB/s with h/w RAID controllers.</FONT>

<FONT COLOR="#000000">> ></FONT>

<FONT COLOR="#000000">> > Are you using partitions on your RAID device?  You shouldn't - that causes</FONT>

<FONT COLOR="#000000">> > unaligned IO to the device and needless read-modify-write for each RAID</FONT>

<FONT COLOR="#000000">> > stripe.</FONT>

<FONT COLOR="#000000">> ></FONT>

<FONT COLOR="#000000">> > Is your RAID geometry efficient with 1MB IOs (e.g. 4+1 or 8+1)?  If not,</FONT>

<FONT COLOR="#000000">> > then you should consider mounting your OSTs with "-o</FONT>

<FONT COLOR="#000000">> > stripe={raid_stripe}",</FONT>

<FONT COLOR="#000000">> > where raid_stripe=N*raid_chunksize, N is the number of data disks for</FONT>

<FONT COLOR="#000000">> > RAID 5 N+1 or RAID 6 N+2.</FONT>

<FONT COLOR="#000000">> ></FONT>

<FONT COLOR="#000000">> > You should download the lustre-iokit and use sgpdd-survey,</FONT>

<FONT COLOR="#000000">> > obdfilter-survey,</FONT>

<FONT COLOR="#000000">> > and PIOS to determine what is causing the performance bottleneck.</FONT>

<FONT COLOR="#000000">> ></FONT>

<FONT COLOR="#000000">> > Cheers, Andreas</FONT>

<FONT COLOR="#000000">> > --</FONT>

<FONT COLOR="#000000">> > Andreas Dilger</FONT>

<FONT COLOR="#000000">> > Sr. Staff Engineer, Lustre Group</FONT>

<FONT COLOR="#000000">> > Sun Microsystems of Canada, Inc.</FONT>

<FONT COLOR="#000000">> ></FONT>

<FONT COLOR="#000000">> </FONT>

<FONT COLOR="#000000">> </FONT>

<FONT COLOR="#000000">> _______________________________________________</FONT>

<FONT COLOR="#000000">> Lustre-discuss mailing list</FONT>

<FONT COLOR="#000000">> <A HREF="mailto:Lustre-discuss@lists.lustre.org">Lustre-discuss@lists.lustre.org</A></FONT>

<FONT COLOR="#000000">> <A HREF="http://lists.lustre.org/mailman/listinfo/lustre-discuss">http://lists.lustre.org/mailman/listinfo/lustre-discuss</A></FONT>


<FONT COLOR="#000000">Cheers, Andreas</FONT>

<FONT COLOR="#000000">--</FONT>

<FONT COLOR="#000000">Andreas Dilger</FONT>

<FONT COLOR="#000000">Sr. Staff Engineer, Lustre Group</FONT>

<FONT COLOR="#000000">Sun Microsystems of Canada, Inc.</FONT>

</PRE>

</BLOCKQUOTE>

</BODY>

</HTML>