[Lustre-discuss] mkfs options/tuning for RAID based OSTs

Wed Oct 20 09:50:28 PDT 2010

Hi Edward,

As Andreas mentioned earlier the max OST size is 16TB if one uses ext4 based
ldiskfs. So creation of RAID group bigger than that will definitely hurt
your performance because you would have to split the large array into
smaller logical disks and that randomises IOs on the raid controller. With
2TB disks, RAID6 is the way to go as the rebuild time of the failed disk is
quite long which increases the chance of double disk failure to
uncomfortable level. So taking that into consideration I think that 8+2
RAID6 with 128kb segment size is the right choice. Spare disk can be used as
hotspares or for external journal.

On 20 October 2010 15:19, Edward Walter <ewalter at cs.cmu.edu> wrote:

> Hi Brian,
>
> Thanks for the clarification.  It didn't click that the optimal data
> size is exactly 1MB...  Everything you're saying makes sense though.
>
> Obviously with 12 disk arrays; there's tension between maximizing space
> and maximizing performance.  I was hoping/trying to get the best of
> both.  The difference between doing 10 data and 2 parity vs 4+2 or 8+2
> works out to a difference of 2 data disks (4 TB) per shelf for us or 24
> TB in total which is why I was trying to figure out how to make this
> work with more data disks.
>
> Thanks to everyone for the input.  This has been very helpful.
>
> -Ed
>
> Brian J. Murrell wrote:
> > On Tue, 2010-10-19 at 21:00 -0400, Edward Walter wrote:
> >
> >
> > Ed,
> >
> >
> >> That seems to validate how I'm interpreting the parameters. We have 10
> data disks and 2 parity disks per array so it looks like we need to be at 64
> KB or less.
> >>
> >
> > I think you have been missing everyone's point in this thread.  The
> > magic value is not "anything below 1MB", it's 1MB exactly.  No more, no
> > less (although I guess technically 256KB or 512KB would work).
> >
> > The reason is that Lustre attempts to package up I/Os from the client to
> > the OST in 1MB chunks.  If the RAID stripe matches that 1MB then when
> > the OSS writes that 1MB to the OST, it's a single write to the RAID disk
> > underlying the OST of 1MB of data plus the parity.
> >
> > Conversely, if the OSS receives 1MB of data for the OST and the RAID
> > stripe under the OST is not 1MB, but less, then 1MB-<raid_stripe_size>
> > will be written as data+parity to the first strip, but the remaining
> > portion of that 1MB of data from the client will be written into the
> > next RAID stripe only partially filling the stripe causing the RAID
> > layer to have to first read that whole stripe, insert the new data,
> > calculate a new parity and then write that whole RAID stripe back out
> > the disk.
> >
> > So as you can see, when your RAID stripe is not exactly 1MB, the RAID
> > code has to do a lot more I/O, which impacts performance, obviously.
> >
> > This is why the recommendations in this thread have continued to be
> > using a number of data disks that divides evenly into 1MB (i.e. powers
> > of 2: 2, 4, 8, etc.).  So for RAID6: 4+2 or 8+2, etc.
> >
> > b.
> >
> >
> > ------------------------------------------------------------------------
> >
> > _______________________________________________
> > Lustre-discuss mailing list
> > Lustre-discuss at lists.lustre.org
> > http://lists.lustre.org/mailman/listinfo/lustre-discuss
> >
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20101020/ee6b4cb8/attachment.htm>