[Lustre-discuss] Metadata performance question

Thu Nov 27 13:29:53 PST 2008

On Nov 25, 2008  16:53 -0500, Ben Evans wrote:
> I've only got a few physical clients (the most I've thrown at it is 11),
> I've tried multiple processes on them, but it only makes  a moderate
> difference.

I'm assuming you also ran each process in a separate mountpoint.

> I've used a differing number of thread allocations, but regardless of
> how many I use, I still see them all go idle for 3-4 seconds.  Sometimes
> there is a u-shaped graph in the number of threads that are running.
> The kjournald thread is only occasionally running during this interval.

How many OSTs do you have in the filesystem and how wide is the striping?
It is possible that the MDS is running out of precreated objects and is
being blocked waiting for the OST to return more.  If you have several 
OSTs in the filesystem (and you are not striping over too many of them)
this should ensure the MDS has a good supply of precreated objects.

While there is a pause, would it be possible to get a stack trace
(e.g. "echo t > /proc/sysrq-trigger") to see where the threads are stuck.
You could also monitor the /proc/fs/lustre/osc/*/prealloc_* values on the
MDS to see if they are ever very close to each other (i.e. difference is
very low), and see what /proc/fs/lustre/osc/*/create_count is.  This
shows how many objects the MDS is creating in a batch.

> I have one solid state drive in-house, and will most likely be receiving
> a few more to use in my testing in the next few weeks, so any insight
> you have into tuning them for maximum performance would be greatly
> appreciated.

There is a patch in bugzilla (search for "SSD") that should help performance
in some cases with SSDs.  It shouldn't make a difference with a 3-4s wait
though.  In the meantime you could try running with the MDS filesystem on
a ramdisk, just to see whether the slowdown is related to the disk or if
it is something else.

> If I greatly increase the caching on the server sides, I get very high
> performance while the cache is filling up, but if I exceed that,
> performance drops off quickly.

What do you mean by "increase the caching"?

> -----Original Message-----
> From: Andreas.Dilger at sun.com [mailto:Andreas.Dilger at sun.com] On Behalf
> Of Andreas Dilger
> Sent: Thursday, November 13, 2008 5:05 PM
> To: Ben Evans
> Cc: lustre-discuss at lists.lustre.org; Atul Vidwansa
> Subject: Re: [Lustre-discuss] Metadata performance question
> 
> On Nov 12, 2008  15:46 -0500, Ben Evans wrote:
> > My MDS has ib on the front end, and a FC array on the back end in a
> > RAID 10 configuration.
> >  
> > 
> > I have a number of clients creating 0 byte files in different
> > directories.  As the test runs, I can see the threads on the MDS
> > idling for 3-4 seconds before they all wake up, process whatever
> > they need to and go back to sleep.
> 
> What is important to know is the number of clients.  One of the current
> limitations on Lustre metadata scalability is that each client can only
> have a single _filesystem_modifying_ MDS request in flight at a time
> (i.e. create, rename, unlink, setattr).  This is to make recovery of
> the filesystem manageable in the face of client asynchronous operations
> being replayed on the MDS.
> 
> It sounds like the number of clients is lower than the number of MDS
> service threads, and the MDS threads are just sitting idle until the
> kernel wakes it up again to handle an incoming request.
> 
> If ALL of the threads are idle for the same 3-4 seconds, it is possible
> they are waiting on the journal commit.  One change that might improve
> performance here dramatically is to use SSD devices for the MDS storage,
> which has MUCH lower latency.  If you decide to try that, give me a
> shout and I can advise you on further MDS filesystem tuning.
> 
> > I haven't been able to find a bottleneck in the system.  There's
> > plenty of memory, changing various parameters in the /proc directory
> > on the clients and on the MDS doesn't help much at all.  There's
> > plenty of bandwidth on the ib network and on the FC.  CPU and memory
> > aren't overly taxed (~50% max).  Yet I can only get about 10k file
> > creates per second, once I've created enough files so that caching 
> 
> You could try either increasing the number of clients (if you have
> more), or as a temporary experiment try mounting the filesystem multiple
> times on each client (in a different location, e.g. /mnt/lustre1,
> /mnt/lustre2, etc) and run your load once for each mount point on the client.
> 
> If either of these improve the performance then the bottleneck is in
> this 1-RPC-per-client limitation.  If the performance does not increase there
> is some other limitation.
> 
> 
> Cheers, Andreas
> --
> Andreas Dilger
> Sr. Staff Engineer, Lustre Group
> Sun Microsystems of Canada, Inc.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.