[Lustre-discuss] mkfs options/tuning for RAID based OSTs

Edward Walter ewalter at cs.cmu.edu
Wed Oct 20 07:19:17 PDT 2010


Hi Brian,

Thanks for the clarification.  It didn't click that the optimal data 
size is exactly 1MB...  Everything you're saying makes sense though. 

Obviously with 12 disk arrays; there's tension between maximizing space 
and maximizing performance.  I was hoping/trying to get the best of 
both.  The difference between doing 10 data and 2 parity vs 4+2 or 8+2 
works out to a difference of 2 data disks (4 TB) per shelf for us or 24 
TB in total which is why I was trying to figure out how to make this 
work with more data disks.

Thanks to everyone for the input.  This has been very helpful.

-Ed

Brian J. Murrell wrote:
> On Tue, 2010-10-19 at 21:00 -0400, Edward Walter wrote: 
>   
>
> Ed,
>
>   
>> That seems to validate how I'm interpreting the parameters. We have 10 data disks and 2 parity disks per array so it looks like we need to be at 64 KB or less.
>>     
>
> I think you have been missing everyone's point in this thread.  The
> magic value is not "anything below 1MB", it's 1MB exactly.  No more, no
> less (although I guess technically 256KB or 512KB would work).
>
> The reason is that Lustre attempts to package up I/Os from the client to
> the OST in 1MB chunks.  If the RAID stripe matches that 1MB then when
> the OSS writes that 1MB to the OST, it's a single write to the RAID disk
> underlying the OST of 1MB of data plus the parity.
>
> Conversely, if the OSS receives 1MB of data for the OST and the RAID
> stripe under the OST is not 1MB, but less, then 1MB-<raid_stripe_size>
> will be written as data+parity to the first strip, but the remaining
> portion of that 1MB of data from the client will be written into the
> next RAID stripe only partially filling the stripe causing the RAID
> layer to have to first read that whole stripe, insert the new data,
> calculate a new parity and then write that whole RAID stripe back out
> the disk.
>
> So as you can see, when your RAID stripe is not exactly 1MB, the RAID
> code has to do a lot more I/O, which impacts performance, obviously.
>
> This is why the recommendations in this thread have continued to be
> using a number of data disks that divides evenly into 1MB (i.e. powers
> of 2: 2, 4, 8, etc.).  So for RAID6: 4+2 or 8+2, etc.
>
> b.
>
>   
> ------------------------------------------------------------------------
>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>   



More information about the lustre-discuss mailing list