[Lustre-discuss] Lustre and disk tuning

Mon Jan 21 15:34:35 PST 2008

On Jan 18, 2008  16:45 -0800, Dan wrote:
>     I'm looking for some advice on improving disk performance and
> understanding what Lustre is doing with it.  Right now I have a ~28 TB
> OSS with 4 OSTs on it.  There are 4 clients using Lustre native - no
> NFS.  If I write to the lustre volume from the clients I get odd
> behavior.  Typically the writes have a long pause before any data
> starts hitting the disks.  Then 2 or 3 of the clients will write
> happily but one or two will not.  Eventually Lustre will pump out a
> number of I/O related errors such as "slow i_mutex 165 seconds, slow
> direct_io 32 seconds" and so on.  Next the clients that couldn't write
> will catch up and pass the clients that could write.  At some point (5
> minutes or so) the jobs start failing without any errors.  New jobs
> can be started after these fail and the pattern repeats.  Write speeds
> are low, around 22 MB/sec per client, the disks shouldn't have any
> problem handling 4 writes at this speed!!  This did work using NFS.
> 
>     When these disks were formated with XFS I/O was fast.  No problems at
> all writing 475 MB/sec sustained per RAID controller (locally, not via
> NFS).  No delays.  After configuring for Lustre the peak sustained
> write (locally) is 230 MB/sec.  It will write for about 2 minutes
> before logging about slow I/O.  This is without any clients connected.
> 
> So far I've done the following:
> 
> 1.  Recompiled SCSI driver for RAID controller to use 1 MB blocks (from
> 256k).
> 2.  Adjusted MDS, OST threads
> 3.  Tried all I/O schedulers
> 4.  Tried all possible settings on RAID controllers for Caching and
> read-ahead.
> 5.  Some minor stuff I forgot about!
> 
> Nothing makes a difference - same results under each configuration except
> for schedulers.  When running the deadline scheduler the writes fail
> faster and have delays around 30 seconds.  With all others the delays
> range from 100 to 500 seconds.
> 
> The system has 4 cores and 4 GB of memory with 4 7 TB OSTs.  The disks are
> in RAID 6 split between two controllers with 2 GB cache each.  One
> controller has the MGS/MDT on it.  When running top it indicates 2/3 to
> 3/4 of memory utilized and 25% CPU utilization normally.

Are you using Lustre 1.4 or 1.6?  Are you mounting your OSTs with
"-o extents,mballoc"?  We've had Lustre OSSs nodes running in excess
of 2GB/s with h/w RAID controllers.

Are you using partitions on your RAID device?  You shouldn't - that causes
unaligned IO to the device and needless read-modify-write for each RAID
stripe.

Is your RAID geometry efficient with 1MB IOs (e.g. 4+1 or 8+1)?  If not,
then you should consider mounting your OSTs with "-o stripe={raid_stripe}",
where raid_stripe=N*raid_chunksize, N is the number of data disks for
RAID 5 N+1 or RAID 6 N+2.

You should download the lustre-iokit and use sgpdd-survey, obdfilter-survey,
and PIOS to determine what is causing the performance bottleneck.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.