[Lustre-discuss] RAID cards - what works well with Lustre?
Charles Taylor
taylor at hpc.ufl.edu
Tue Jul 5 10:58:23 PDT 2011
We use adaptec 51245s and 51645s with
1. max_hw_sectors_kb=512
2. RAID5 4+1 or RAID6 4+2
3. RAID chunk size = 128
So each 1 MB lustre RPC results in two 4-way, striped writes with no
read-modify-write penalty. We can further improve write performance
by matching the max_pages_per_rpc (per OST on the client side) i.e.
the max rpc size to the max_hw_sectors_kb setting for the block
devices. In this case
max_pages_per_rpc=128
instead of the default 256 at which point you have 1 raid-stripe write
per rpc.
If you put your OSTs atop LVs (LVM2) as we do, you will want to take
the additional step of making sure your LVs are aligned as well.
pvcreate --dataalignment 1024S /dev/sd$driveChar
You need a fairly new version of the LVM2 that supports the --
dataalignment option. We are using lvm2-2.02.56-8.el5_5.6.x86_64.
Note that we attempted to increase the max_hw_sectors_kb for the block
devices (RAID LDs) to 1024 but in order to do so, we needed to change
the adaptec driver (aacraid) kernel parameter acbsize=8192 which we
found to be unstable. For our adaptec drivers we use..
options aacraid cache=7 msi=2 expose_physicals=-1 acbsize=4096
Note that most of the information above was the result of testing and
tuning performed here by Craig Prescott.
We now have close to a PB of such storage in production here at the UF
HPC Center. We used Areca cards at first but found them to be a bit
too flakey for our needs. The adaptecs seem to have some infant
mortality issues. We RMA about 10% to 12% percent of newly purchased
cards but if they make it past initial burn-in testing, they tend to
be pretty reliable.
Regards,
Charlie Taylor
UF HPC Center
On Jul 5, 2011, at 12:33 PM, Daire Byrne wrote:
> Hi,
>
> I have been testing some LSI 9260 RAID cards for use with Lustre
> v1.8.6 but have found that the "megaraid_sas" driver is not really
> able to facilitate the 1MB full stripe IOs that Lustre likes. This
> topic has also come up recently in the following two email threads:
>
> http://groups.google.com/group/lustre-discuss-list/browse_thread/thread/65a1fdc312b0eccb#
> http://groups.google.com/group/lustre-discuss-list/browse_thread/thread/fcf39d85b7e945ab
>
> I was able to up the max_hw_sectors_kb -> 1024 by setting the
> "max_sectors" megaraid_sas module option but found that the IOs were
> still being pretty fragmented:
>
> disk I/O size ios % cum % | ios % cum %
> 4K: 3060 0 0 | 2611 0 0
> 8K: 3261 0 0 | 2664 0 0
> 16K: 6408 0 1 | 5296 0 1
> 32K: 13025 1 2 | 10692 1 2
> 64K: 48397 4 6 | 26417 2 4
> 128K: 50166 4 10 | 42218 4 9
> 256K: 113124 9 20 | 86516 8 17
> 512K: 677242 57 78 | 448231 45 63
> 1M: 254195 21 100 | 355804 36 100
>
> So next I looked at the sg_tablesize and found it was being set to
> "80" by the driver (which queries the firmware). I tried to hack the
> driver and increase this value but bad things happened and so it
> looks like it is a genuine hardware limit with these cards.
>
> The overall throughput isn't exactly terrible because the RAID write-
> back cache does a reasonable job but I suspect it could be better,
> e.g.
>
> ost 3 sz 201326592K rsz 1024K obj 192 thr 192 write 1100.52
> [ 231.75, 529.96] read 940.26 [ 275.70, 357.60]
> ost 3 sz 201326592K rsz 1024K obj 192 thr 384 write 1112.19
> [ 184.80, 546.43] read 1169.20 [ 337.63, 462.52]
> ost 3 sz 201326592K rsz 1024K obj 192 thr 768 write 1217.79
> [ 219.77, 665.32] read 1532.47 [ 403.58, 552.43]
> ost 3 sz 201326592K rsz 1024K obj 384 thr 384 write 920.87
> [ 171.82, 466.77] read 901.03 [ 257.73, 372.87]
> ost 3 sz 201326592K rsz 1024K obj 384 thr 768 write 1058.11
> [ 166.83, 681.25] read 1309.63 [ 346.64, 484.51]
>
> All of this brings me to my main question - what internal cards have
> people here used which work well with Lustre? 3ware, Areca or other
> models of LSI?
>
> Cheers,
>
> Daire
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20110705/65410cc5/attachment.htm>
More information about the lustre-discuss
mailing list