[Lustre-discuss] external journal raid1 vs. single disk ext journal + hot spare on raid6

Thu May 14 20:37:46 PDT 2009

Hi Stuart,

On Thu, May 14, 2009 at 01:08:36PM -0700, Stuart Marshall wrote:
>Each 6140 tray will be configured either as 1 or 2 RAID6 volumes.  The
>lustre manual recommends more smaller OST's over large and other docs I've
>seen seem to indicate that the optimal number of drives is ~(6+2).  For
>these 16 disk trays, the choice would be one (12+2R6) + external journal
>and/or hot spares or two (5+2R6)'s + ext. jrnl and/or hot spares.

2^n+partiy (eg. 8+2 R6) is generally best with software raid, and
presumably with your 6140 too. 8+2 with a 64k/128k chunk size means
512kB/1MB per data stripe which plays nicely with Lustre's 1M data
transfer sizes.
presumably you have 6+2 because that fits neatly into your 16 disk
units - these things are always a compromise :-/

>So my questions are:
>
>1.) What are the trade-offs of RAID1 external journal with no hot spare vs.
>single disk ext journal with a hot spare (spare is for R6 volume)?
>Specifically:

external journal takes away 1/2 the seeks (small writes to the journal)
when writing to RAID5/6's so can double your write speeds. it does for
us with software raid. having said that, if you have a large NVRAM cache
in your hardware raid then you might not notice these extra seeks as
they mostly go to ram and are flushed to spinning disk much less frequently.

also I believe Lustre 1.8 hides the slowness of internal journals
better than 1.6. IIRC, it allows multiple outstanding writes to be in
flight (like metadata in 1.6) and holds copies of data on clients for
replay in case an OSS crashes. so with 1.8 you may not notice external
journals helping all that much.

>- If a single disk external journal is lost, can we run fsck and only lose
>the transactions that have not been committed to disk?  If so, then the loss
>of the disk hosting the external journal would not be catastrophic for the
>file system as a whole.

I think so, yes, although we run external journals on RAID1. if you
lose the journal device then you might have to tune2fs to delete the
external journal from the fs before you fsck, as fsck will go looking
for the (dead/missing) journal device and will sulk.

one problem we came across was that ext3/ldiskfs hard-codes the device
name of the external journal (eg. /dev/md5 or /dev/sdc1 or whatever)
into the filesystem. 
that means that when you failover OSS's it will look for /dev/whatever
on the failed-over node, and won't mount if it can't find it.
so you need non-intersecting namespaces of journal devices within an OSS
pair, so that each regular and failed-over RAID5/6 can always find its
correct journal device.
I didn't manage to get ext3/ldiskfs to be sane and use UUID's instead of
hardcoded device names :-/
presumably you could also tune2fs to rename or delete the external
journal as part of a failover, but that's a horrible hack.  

>- How comfortable are RAID6 users with no hot spares? (We'll have cold
>spares handy, but prefer to get through weekends w/out service)

fairly comfy. you can do the sums and work out the likelyhood of dual
failures given your drive sizes and errors rates, and it's not
outrageous. assumes no correlations between drive failures of course...

>2.) The external journal only takes up ~400MB.  If we create 2 RAID6
>volumes, can we put 2 external journals on one disk or RAID1 set (suitably
>partitioned), or do we need to blow an entire disk for one external journal?

ext3/ldiskfs won't let you share multiple fs's in one journal (although
apparently it's technically possible), but as you say, you can just
make 2 small partitions and put a journal on each.
they will interfere if both fs's are writing heavily (no interference on
reads), but I'd guess (only a guess - I haven't measured it) the
penalty should still be smaller than with internal journals.
the Lustre 1.8 changes should probably help both external shared and
internal journal cases.
I believe Sun folks have some numbers about such shared scenarios that
you might be able to cajole out of them.

>3.) In planning for "segment size" (chunk size in lustre manual) we'd have
>to go to 128kB or lower.  However, in single disk tests (SATA), it seems
>that larger is better so perhaps this argues for small RAID6 sets as
>mentioned in the manual.  Just wondering what other folks have found here
>also.

you don't want your RAID chunk size to be such that disks*chunk > 1MB,
as then every Lustre op will be hitting less than one stripe on the RAID,
which cause read-modify-writes, and will be slow.

>We have the opportunity to test several scenarios with 2 6140 trays that are
>not part of the 1.6.x production system so I expect we will test performance
>as a function of the number of drives in the RAID6 volume (eg. 12+2 vs 5+2)
>along with array write segment sizes via sgpdd-survey.
>
>I'll report back with test results once we sort out which knobs seem to make
>the most difference.

that would be great to know.
6140's are probably quite different from the software raid md SAS JBODs
we run here.

cheers,
robin
--
Dr Robin Humble, HPC Systems Analyst, NCI National Facility