[Lustre-discuss] Lustre and disk tuning

Wed Jan 30 18:32:07 PST 2008

Sorry for the long delay!

I'm running Lustre 1.6.4.2.

I'm mounting with default options.  When I used -o extents,mballoc it
mounts and the volume hangs.  I tried to check it out with ldiskfs but no
luck.  I had to reboot the machine (hard boot at that) to get the devices
back.  It appears in the logs to mount with mballoc by default.

I'm not using partitions on the RAID devices.  I have two RAID controllers
in the system.  All disks on each are grouped into a single RAID 6.  The
first controller has three volumes one for the MGS/MDT and two OSTs.  The
other only has two OSTs.

I attempted using the -o stripe=<raid_stripe=N*raid_chunksize> but no
luck.  When mounting the OSTs with the stripe option they hang and never
mount.  I've tried a couple of stipe sizes.

I was a little uncertain of the stripe size calculation so here we go...
My chunk size is 128k and there are 23 disks in RAID 6 (one hot spare
leave 23).  That means 21 data disks?  Judging by your formula I take 23 *
128k whis is 2944.  Is this even close to what you intended?  This stripe
size hangs at mount...

I've tried to test with the lustre-io kit but the tests (writes) fail on
most OSTs.  That is the problem I'm having after all... frustrating.

Would it make sense to reconfigure the RAID controllers to have separate
groups of disks in RAID 6?  For performance is there a recommended max
size or number of disks for each OST?  Lastly, is it worth while to
consider putting the ext3 journal on another device exported from the RAID
controller?

Thank you!!

Dan

> On Jan 18, 2008  16:45 -0800, Dan wrote:
>>     I'm looking for some advice on improving disk performance and
>> understanding what Lustre is doing with it.  Right now I have a ~28 TB
>> OSS with 4 OSTs on it.  There are 4 clients using Lustre native - no
>> NFS.  If I write to the lustre volume from the clients I get odd
>> behavior.  Typically the writes have a long pause before any data
>> starts hitting the disks.  Then 2 or 3 of the clients will write
>> happily but one or two will not.  Eventually Lustre will pump out a
>> number of I/O related errors such as "slow i_mutex 165 seconds, slow
>> direct_io 32 seconds" and so on.  Next the clients that couldn't write
>> will catch up and pass the clients that could write.  At some point (5
>> minutes or so) the jobs start failing without any errors.  New jobs
>> can be started after these fail and the pattern repeats.  Write speeds
>> are low, around 22 MB/sec per client, the disks shouldn't have any
>> problem handling 4 writes at this speed!!  This did work using NFS.
>>
>>     When these disks were formated with XFS I/O was fast.  No problems
>> at
>> all writing 475 MB/sec sustained per RAID controller (locally, not via
>> NFS).  No delays.  After configuring for Lustre the peak sustained
>> write (locally) is 230 MB/sec.  It will write for about 2 minutes
>> before logging about slow I/O.  This is without any clients connected.
>>
>> So far I've done the following:
>>
>> 1.  Recompiled SCSI driver for RAID controller to use 1 MB blocks (from
>> 256k).
>> 2.  Adjusted MDS, OST threads
>> 3.  Tried all I/O schedulers
>> 4.  Tried all possible settings on RAID controllers for Caching and
>> read-ahead.
>> 5.  Some minor stuff I forgot about!
>>
>> Nothing makes a difference - same results under each configuration
>> except
>> for schedulers.  When running the deadline scheduler the writes fail
>> faster and have delays around 30 seconds.  With all others the delays
>> range from 100 to 500 seconds.
>>
>> The system has 4 cores and 4 GB of memory with 4 7 TB OSTs.  The disks
>> are
>> in RAID 6 split between two controllers with 2 GB cache each.  One
>> controller has the MGS/MDT on it.  When running top it indicates 2/3 to
>> 3/4 of memory utilized and 25% CPU utilization normally.
>
> Are you using Lustre 1.4 or 1.6?  Are you mounting your OSTs with
> "-o extents,mballoc"?  We've had Lustre OSSs nodes running in excess
> of 2GB/s with h/w RAID controllers.
>
> Are you using partitions on your RAID device?  You shouldn't - that causes
> unaligned IO to the device and needless read-modify-write for each RAID
> stripe.
>
> Is your RAID geometry efficient with 1MB IOs (e.g. 4+1 or 8+1)?  If not,
> then you should consider mounting your OSTs with "-o
> stripe={raid_stripe}",
> where raid_stripe=N*raid_chunksize, N is the number of data disks for
> RAID 5 N+1 or RAID 6 N+2.
>
> You should download the lustre-iokit and use sgpdd-survey,
> obdfilter-survey,
> and PIOS to determine what is causing the performance bottleneck.
>
> Cheers, Andreas
> --
> Andreas Dilger
> Sr. Staff Engineer, Lustre Group
> Sun Microsystems of Canada, Inc.
>