[Lustre-discuss] RAID cards - what works well with Lustre?

Daire Byrne daire.byrne at gmail.com
Mon Jul 18 07:54:33 PDT 2011


Thanks for the replies and insight. I played around with various
sg_tablesize and max_hw_sectors_kb values but it didn't seem to make very
much actual difference to the overall performance (using obdfilter-survey).
I guess the RAM cache in these cards helps to smooth things out despite the
fact that the IOs going to disk are rarely ever 1MB.

I'll test an Adaptec card too but we don't really want to do 4+2 and lose so
many disks to RAID6. I'll see how they perform with 8+2.

Thanks again,

Daire

On Tue, Jul 5, 2011 at 6:58 PM, Charles Taylor <taylor at hpc.ufl.edu> wrote:

> We use adaptec 51245s and 51645s with
>
> 1. max_hw_sectors_kb=512
> 2. RAID5 4+1 or RAID6 4+2
> 3. RAID chunk size = 128
>
> So each 1 MB lustre RPC results in two 4-way, striped writes with no
> read-modify-write penalty.   We can further improve write performance by
> matching the max_pages_per_rpc (per OST on the client side) i.e. the max rpc
> size to the max_hw_sectors_kb setting for the block devices.   In this case
>
> max_pages_per_rpc=128
>
> instead of the default 256 at which point you have 1 raid-stripe write per
> rpc.
>
> If you put your OSTs atop LVs (LVM2) as we do, you will want to take the
> additional step of making sure your LVs are aligned as well.
>
> pvcreate --dataalignment 1024S /dev/sd$driveChar
>
> You need a fairly new version of the LVM2 that supports the --dataalignment
> option.     We are using  lvm2-2.02.56-8.el5_5.6.x86_64.
>
> Note that we attempted to increase the max_hw_sectors_kb for the block
> devices (RAID LDs) to 1024 but in order to do so, we needed to change the
> adaptec driver (aacraid) kernel parameter acbsize=8192 which we found to be
> unstable.    For our adaptec drivers we use..
>
> options aacraid cache=7 msi=2 expose_physicals=-1 acbsize=4096
>
> Note that most of the information above was the result of testing and
> tuning performed here by Craig Prescott.
>
> We now have close to a PB of such storage in production here at the UF HPC
> Center.   We used Areca cards at first but found them to be a bit too flakey
> for our needs.     The adaptecs seem to have some infant mortality issues.
> We RMA about 10% to 12% percent of newly purchased cards but if they make it
> past initial burn-in testing, they tend to be pretty reliable.
>
> Regards,
>
> Charlie Taylor
> UF HPC Center
>
>
>
>
>
>
>
>
>
> On Jul 5, 2011, at 12:33 PM, Daire Byrne wrote:
>
> Hi,
>
> I have been testing some LSI 9260 RAID cards for use with Lustre v1.8.6 but
> have found that the "megaraid_sas" driver is not really able to facilitate
> the 1MB full stripe IOs that Lustre likes. This topic has also come up
> recently in the following two email threads:
>
>
> http://groups.google.com/group/lustre-discuss-list/browse_thread/thread/65a1fdc312b0eccb#
>
> http://groups.google.com/group/lustre-discuss-list/browse_thread/thread/fcf39d85b7e945ab
>
> I was able to up the max_hw_sectors_kb -> 1024 by setting the "max_sectors"
> megaraid_sas module option but found that the IOs were still being pretty
> fragmented:
>
> disk I/O size          ios   % cum % |  ios   % cum %
> 4K:                   3060   0   0   | 2611   0   0
> 8K:                   3261   0   0   | 2664   0   0
> 16K:                  6408   0   1   | 5296   0   1
> 32K:                 13025   1   2   | 10692   1   2
> 64K:                 48397   4   6   | 26417   2   4
> 128K:                50166   4  10   | 42218   4   9
> 256K:               113124   9  20   | 86516   8  17
> 512K:               677242  57  78   | 448231  45  63
> 1M:                 254195  21 100   | 355804  36 100
>
> So next I looked at the sg_tablesize and found it was being set to "80" by
> the driver (which queries the firmware). I tried to hack the driver and
> increase this value but bad things happened and so it looks like it is a
> genuine hardware limit with these cards.
>
> The overall throughput isn't exactly terrible because the
> RAID write-back cache does a reasonable job but I suspect it could be
> better, e.g.
>
> ost  3 sz 201326592K rsz 1024K obj  192 thr  192 write 1100.52 [ 231.75,
> 529.96] read  940.26 [ 275.70, 357.60]
> ost  3 sz 201326592K rsz 1024K obj  192 thr  384 write 1112.19 [ 184.80,
> 546.43] read 1169.20 [ 337.63, 462.52]
> ost  3 sz 201326592K rsz 1024K obj  192 thr  768 write 1217.79 [ 219.77,
> 665.32] read 1532.47 [ 403.58, 552.43]
> ost  3 sz 201326592K rsz 1024K obj  384 thr  384 write  920.87 [ 171.82,
> 466.77] read  901.03 [ 257.73, 372.87]
> ost  3 sz 201326592K rsz 1024K obj  384 thr  768 write 1058.11 [ 166.83,
> 681.25] read 1309.63 [ 346.64, 484.51]
>
> All of this brings me to my main question - what internal cards have people
> here used which work well with Lustre?  3ware, Areca or other models of LSI?
>
> Cheers,
>
> Daire
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20110718/21aab787/attachment.htm>


More information about the lustre-discuss mailing list