[Lustre-discuss] Lustre and disk tuning

Thu Jan 31 11:38:55 PST 2008

On Jan 31, 2008  08:25 -0800, Dan wrote:
> Thanks Andreas.  I'll reconfigure the RAID and give it another shot
> today.  Would it be reasonable to credit the stalled writes with this
> I/O mismatch I have?

It would definitely hurt performance...  Also, placing the MDT on the
same RAID6 is not very desirable...  Given that you now have a few
spare disks on the system, I'd also recommend a separate RAID 0+1 for
the MDT device.

> On Thu, 2008-01-31 at 01:40 -0700, Andreas Dilger wrote:
> > On Jan 30, 2008  18:32 -0800, Dan wrote:
> > > I was a little uncertain of the stripe size calculation so here we go...
> > > My chunk size is 128k and there are 23 disks in RAID 6 (one hot spare
> > > leave 23).  That means 21 data disks?  Judging by your formula I take 23 *
> > > 128k whis is 2944.  Is this even close to what you intended?  This stripe
> > > size hangs at mount...
> > 
> > Hmm, I don't think the mballoc code can efficiently deal with a stripe size 
> > larger than the RPC size (which is 1MB) because this will always result in
> > a read-modify-write of the RAID stripe as not enough data can be collected
> > to fill a stripe.
> > 
> > > I've tried to test with the lustre-io kit but the tests (writes) fail on
> > > most OSTs.  That is the problem I'm having after all... frustrating.
> > > 
> > > Would it make sense to reconfigure the RAID controllers to have separate
> > > groups of disks in RAID 6?  For performance is there a recommended max
> > > size or number of disks for each OST?  Lastly, is it worth while to
> > > consider putting the ext3 journal on another device exported from the RAID
> > > controller?
> > 
> > Having 21 disks in the RAID set is probably too large to be practical
> > because of the high overhead of doing IO of such a large size.
> > Good configurations for such a system might be 2x 8+2 + spare = 21 disks
> > with 128kB chunk size, or 16+2 + spare = 19 disks with 64kB chunk size.
> > Both result in 1MB full stripe size, which is what mballoc and Lustre
> > are optimized to by default.
> > 
> > > > On Jan 18, 2008  16:45 -0800, Dan wrote:
> > > >>     I'm looking for some advice on improving disk performance and
> > > >> understanding what Lustre is doing with it.  Right now I have a ~28 TB
> > > >> OSS with 4 OSTs on it.  There are 4 clients using Lustre native - no
> > > >> NFS.  If I write to the lustre volume from the clients I get odd
> > > >> behavior.  Typically the writes have a long pause before any data
> > > >> starts hitting the disks.  Then 2 or 3 of the clients will write
> > > >> happily but one or two will not.  Eventually Lustre will pump out a
> > > >> number of I/O related errors such as "slow i_mutex 165 seconds, slow
> > > >> direct_io 32 seconds" and so on.  Next the clients that couldn't write
> > > >> will catch up and pass the clients that could write.  At some point (5
> > > >> minutes or so) the jobs start failing without any errors.  New jobs
> > > >> can be started after these fail and the pattern repeats.  Write speeds
> > > >> are low, around 22 MB/sec per client, the disks shouldn't have any
> > > >> problem handling 4 writes at this speed!!  This did work using NFS.
> > > >>
> > > >>     When these disks were formated with XFS I/O was fast.  No problems
> > > >> at
> > > >> all writing 475 MB/sec sustained per RAID controller (locally, not via
> > > >> NFS).  No delays.  After configuring for Lustre the peak sustained
> > > >> write (locally) is 230 MB/sec.  It will write for about 2 minutes
> > > >> before logging about slow I/O.  This is without any clients connected.
> > > >>
> > > >> So far I've done the following:
> > > >>
> > > >> 1.  Recompiled SCSI driver for RAID controller to use 1 MB blocks (from
> > > >> 256k).
> > > >> 2.  Adjusted MDS, OST threads
> > > >> 3.  Tried all I/O schedulers
> > > >> 4.  Tried all possible settings on RAID controllers for Caching and
> > > >> read-ahead.
> > > >> 5.  Some minor stuff I forgot about!
> > > >>
> > > >> Nothing makes a difference - same results under each configuration
> > > >> except
> > > >> for schedulers.  When running the deadline scheduler the writes fail
> > > >> faster and have delays around 30 seconds.  With all others the delays
> > > >> range from 100 to 500 seconds.
> > > >>
> > > >> The system has 4 cores and 4 GB of memory with 4 7 TB OSTs.  The disks
> > > >> are
> > > >> in RAID 6 split between two controllers with 2 GB cache each.  One
> > > >> controller has the MGS/MDT on it.  When running top it indicates 2/3 to
> > > >> 3/4 of memory utilized and 25% CPU utilization normally.
> > > >
> > > > Are you using Lustre 1.4 or 1.6?  Are you mounting your OSTs with
> > > > "-o extents,mballoc"?  We've had Lustre OSSs nodes running in excess
> > > > of 2GB/s with h/w RAID controllers.
> > > >
> > > > Are you using partitions on your RAID device?  You shouldn't - that causes
> > > > unaligned IO to the device and needless read-modify-write for each RAID
> > > > stripe.
> > > >
> > > > Is your RAID geometry efficient with 1MB IOs (e.g. 4+1 or 8+1)?  If not,
> > > > then you should consider mounting your OSTs with "-o
> > > > stripe={raid_stripe}",
> > > > where raid_stripe=N*raid_chunksize, N is the number of data disks for
> > > > RAID 5 N+1 or RAID 6 N+2.
> > > >
> > > > You should download the lustre-iokit and use sgpdd-survey,
> > > > obdfilter-survey,
> > > > and PIOS to determine what is causing the performance bottleneck.
> > > >
> > > > Cheers, Andreas
> > > > --
> > > > Andreas Dilger
> > > > Sr. Staff Engineer, Lustre Group
> > > > Sun Microsystems of Canada, Inc.
> > > >
> > > 
> > > 
> > > _______________________________________________
> > > Lustre-discuss mailing list
> > > Lustre-discuss at lists.lustre.org
> > > http://lists.lustre.org/mailman/listinfo/lustre-discuss
> > 
> > Cheers, Andreas
> > --
> > Andreas Dilger
> > Sr. Staff Engineer, Lustre Group
> > Sun Microsystems of Canada, Inc.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.