[Lustre-discuss] [HPDD-discuss] Disk array setup opinions

Michael Shuey shuey at purdue.edu
Sat Mar 9 07:43:10 PST 2013


If you can, I'd advocate the route you suggest - multiple RAID groups,
each group maps to a unique LUN, and each LUN is an OST.  Note that
you'll likely want the number of data disks in each RAID to be a power
of 2 (e.g., 6- or 10-disk raid6, 5- or 9-disk raid5).  Obviously,
you'll be wasting more spindles on overhead (RAID parity), but
performance is more predictable.

The other way (single RAID, multiple LUNs, LUN == OST) means the
performance of OSTs aren't independent - they all bottleneck on the
same RAID array.  If you have enough RAID controller bandwidth, this
can (theoretically) work, but makes hunting/fixing performance
problems more complex.  In Lustre, if it's not writing fast enough,
you can just stripe over more OSTs.  However, if your OSTs aren't
really independent, that may or may not help - you'll get different
bandwidth depending on how many OSTs are sharing the same pool of
physical disks.  I'd expect two OSTs that don't share drives to write
faster than two that do, and so on.

BTW, if you have multiple controllers, and the LUN platform has a
sense of controller affinity (i.e., a LUN uses one controller as
"primary" and another as "secondary" or "backup"), try and balance
your RAIDs across the two controllers in your array.  For instance,
stick even-numbered LUNs on one, odd-numbered LUNs on another.  Also,
if you're doing multi-pathing into your OSSes, make sure your
multipath drivers are aware of this arrangement, and respect it.

Most midrange disk trays will do multipath, and cache mirroring
between controllers - but if you read the fine print, you often find
that access through the secondary controller is MUCH slower.  It's
usually implemented as a write-through to the primary, or has its
cache disabled while the primary is active, etc.  Cache mirroring at
high speed is hard, complicated, and expensive, so vendors often only
implement what's minimally necessary to do failover - even if it means
the secondary controller doesn't cache a LUN unless the primary dies.
If you have one of these (and I've no idea if Dell's 3200 does this,
but this behavior is common enough I'd think about it), you'll want to
split LUNs evenly between controllers to maximize the cache use.
You'll also want to make sure the OSS knows which path is primary for
which LUN, so it doesn't send traffic down the wrong path (or worse,
down both - round-robin balancing is a bad idea when the paths are
asymmetric) unless there's been a hardware failure.

BTW, if you implemented a single RAID group and exported multiple
LUNs, any multi-controller effects can get way more complicated - and
are highly implementation-dependent.

TL;DR - Multi-raid, RAID group == LUN == OST.  Keep OSTs as
independent as you can, and watch your controller and OSS multipath
settings (if used).

--
Mike Shuey


On Sat, Mar 9, 2013 at 10:19 AM, Jerome, Ron <Ron.Jerome at ssc-spc.gc.ca> wrote:
> I am currently having a debate about the best way to carve up Dell MD3200's to be used as OST's in a Lustre file system and I invite this community to weigh in...
>
> I am of the opinion that it should be setup as multiple raid groups each having a single LUN, with each raid group representing an OST, while my colleague feels that it should be setup as a single raid group across the whole array with multiple LUNS, with each LUN representing an OST.
>
> Does anyone in this group have an opinion (one way or another)?
>
> Regards,
>
> Ron Jerome
> _______________________________________________
> HPDD-discuss mailing list
> HPDD-discuss at lists.01.org
> https://lists.01.org/mailman/listinfo/hpdd-discuss



More information about the lustre-discuss mailing list