[Lustre-discuss] RAID cards - what works well with Lustre?

Charles Taylor taylor at hpc.ufl.edu
Tue Jul 5 10:58:23 PDT 2011


We use adaptec 51245s and 51645s with

1. max_hw_sectors_kb=512
2. RAID5 4+1 or RAID6 4+2
3. RAID chunk size = 128

So each 1 MB lustre RPC results in two 4-way, striped writes with no  
read-modify-write penalty.   We can further improve write performance  
by matching the max_pages_per_rpc (per OST on the client side) i.e.  
the max rpc size to the max_hw_sectors_kb setting for the block  
devices.   In this case

max_pages_per_rpc=128

instead of the default 256 at which point you have 1 raid-stripe write  
per rpc.

If you put your OSTs atop LVs (LVM2) as we do, you will want to take  
the additional step of making sure your LVs are aligned as well.

pvcreate --dataalignment 1024S /dev/sd$driveChar

You need a fairly new version of the LVM2 that supports the -- 
dataalignment option.     We are using  lvm2-2.02.56-8.el5_5.6.x86_64.

Note that we attempted to increase the max_hw_sectors_kb for the block  
devices (RAID LDs) to 1024 but in order to do so, we needed to change  
the adaptec driver (aacraid) kernel parameter acbsize=8192 which we  
found to be unstable.    For our adaptec drivers we use..

options aacraid cache=7 msi=2 expose_physicals=-1 acbsize=4096

Note that most of the information above was the result of testing and  
tuning performed here by Craig Prescott.

We now have close to a PB of such storage in production here at the UF  
HPC Center.   We used Areca cards at first but found them to be a bit  
too flakey for our needs.     The adaptecs seem to have some infant  
mortality issues.   We RMA about 10% to 12% percent of newly purchased  
cards but if they make it past initial burn-in testing, they tend to  
be pretty reliable.

Regards,

Charlie Taylor
UF HPC Center









On Jul 5, 2011, at 12:33 PM, Daire Byrne wrote:

> Hi,
>
> I have been testing some LSI 9260 RAID cards for use with Lustre  
> v1.8.6 but have found that the "megaraid_sas" driver is not really  
> able to facilitate the 1MB full stripe IOs that Lustre likes. This  
> topic has also come up recently in the following two email threads:
>
> http://groups.google.com/group/lustre-discuss-list/browse_thread/thread/65a1fdc312b0eccb#
> http://groups.google.com/group/lustre-discuss-list/browse_thread/thread/fcf39d85b7e945ab
>
> I was able to up the max_hw_sectors_kb -> 1024 by setting the  
> "max_sectors" megaraid_sas module option but found that the IOs were  
> still being pretty fragmented:
>
> disk I/O size          ios   % cum % |  ios   % cum %
> 4K:                   3060   0   0   | 2611   0   0
> 8K:                   3261   0   0   | 2664   0   0
> 16K:                  6408   0   1   | 5296   0   1
> 32K:                 13025   1   2   | 10692   1   2
> 64K:                 48397   4   6   | 26417   2   4
> 128K:                50166   4  10   | 42218   4   9
> 256K:               113124   9  20   | 86516   8  17
> 512K:               677242  57  78   | 448231  45  63
> 1M:                 254195  21 100   | 355804  36 100
>
> So next I looked at the sg_tablesize and found it was being set to  
> "80" by the driver (which queries the firmware). I tried to hack the  
> driver and increase this value but bad things happened and so it  
> looks like it is a genuine hardware limit with these cards.
>
> The overall throughput isn't exactly terrible because the RAID write- 
> back cache does a reasonable job but I suspect it could be better,  
> e.g.
>
> ost  3 sz 201326592K rsz 1024K obj  192 thr  192 write 1100.52  
> [ 231.75, 529.96] read  940.26 [ 275.70, 357.60]
> ost  3 sz 201326592K rsz 1024K obj  192 thr  384 write 1112.19  
> [ 184.80, 546.43] read 1169.20 [ 337.63, 462.52]
> ost  3 sz 201326592K rsz 1024K obj  192 thr  768 write 1217.79  
> [ 219.77, 665.32] read 1532.47 [ 403.58, 552.43]
> ost  3 sz 201326592K rsz 1024K obj  384 thr  384 write  920.87  
> [ 171.82, 466.77] read  901.03 [ 257.73, 372.87]
> ost  3 sz 201326592K rsz 1024K obj  384 thr  768 write 1058.11  
> [ 166.83, 681.25] read 1309.63 [ 346.64, 484.51]
>
> All of this brings me to my main question - what internal cards have  
> people here used which work well with Lustre?  3ware, Areca or other  
> models of LSI?
>
> Cheers,
>
> Daire
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20110705/65410cc5/attachment.htm>


More information about the lustre-discuss mailing list