[Lustre-discuss] mkfs options/tuning for RAID based OSTs
Edward Walter
ewalter at cs.cmu.edu
Wed Oct 20 07:19:17 PDT 2010
Hi Brian,
Thanks for the clarification. It didn't click that the optimal data
size is exactly 1MB... Everything you're saying makes sense though.
Obviously with 12 disk arrays; there's tension between maximizing space
and maximizing performance. I was hoping/trying to get the best of
both. The difference between doing 10 data and 2 parity vs 4+2 or 8+2
works out to a difference of 2 data disks (4 TB) per shelf for us or 24
TB in total which is why I was trying to figure out how to make this
work with more data disks.
Thanks to everyone for the input. This has been very helpful.
-Ed
Brian J. Murrell wrote:
> On Tue, 2010-10-19 at 21:00 -0400, Edward Walter wrote:
>
>
> Ed,
>
>
>> That seems to validate how I'm interpreting the parameters. We have 10 data disks and 2 parity disks per array so it looks like we need to be at 64 KB or less.
>>
>
> I think you have been missing everyone's point in this thread. The
> magic value is not "anything below 1MB", it's 1MB exactly. No more, no
> less (although I guess technically 256KB or 512KB would work).
>
> The reason is that Lustre attempts to package up I/Os from the client to
> the OST in 1MB chunks. If the RAID stripe matches that 1MB then when
> the OSS writes that 1MB to the OST, it's a single write to the RAID disk
> underlying the OST of 1MB of data plus the parity.
>
> Conversely, if the OSS receives 1MB of data for the OST and the RAID
> stripe under the OST is not 1MB, but less, then 1MB-<raid_stripe_size>
> will be written as data+parity to the first strip, but the remaining
> portion of that 1MB of data from the client will be written into the
> next RAID stripe only partially filling the stripe causing the RAID
> layer to have to first read that whole stripe, insert the new data,
> calculate a new parity and then write that whole RAID stripe back out
> the disk.
>
> So as you can see, when your RAID stripe is not exactly 1MB, the RAID
> code has to do a lot more I/O, which impacts performance, obviously.
>
> This is why the recommendations in this thread have continued to be
> using a number of data disks that divides evenly into 1MB (i.e. powers
> of 2: 2, 4, 8, etc.). So for RAID6: 4+2 or 8+2, etc.
>
> b.
>
>
> ------------------------------------------------------------------------
>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>
More information about the lustre-discuss
mailing list