[Lustre-discuss] Metadata performance question

Thu Nov 13 14:05:24 PST 2008

On Nov 12, 2008  15:46 -0500, Ben Evans wrote:
> My MDS has ib on the front end, and a FC array on the back end in a RAID
> 10 configuration.
>  
> 
> I have a number of clients creating 0 byte files in different
> directories.  As the test runs, I can see the threads on the MDS idling
> for 3-4 seconds before they all wake up, process whatever they need to
> and go back to sleep.

What is important to know is the number of clients.  One of the current
limitations on Lustre metadata scalability is that each client can only
have a single _filesystem_modifying_ MDS request in flight at a time
(i.e. create, rename, unlink, setattr).  This is to make recovery of
the filesystem manageable in the face of client asynchronous operations
being replayed on the MDS.

It sounds like the number of clients is lower than the number of MDS
service threads, and the MDS threads are just sitting idle until the
kernel wakes it up again to handle an incoming request.

If ALL of the threads are idle for the same 3-4 seconds, it is possible
they are waiting on the journal commit.  One change that might improve
performance here dramatically is to use SSD devices for the MDS storage,
which has MUCH lower latency.  If you decide to try that, give me a shout
and I can advise you on further MDS filesystem tuning.

> I haven't been able to find a bottleneck in the system.  There's plenty
> of memory, changing various parameters in the /proc directory on the
> clients and on the MDS doesn't help much at all.  There's plenty of
> bandwidth on the ib network and on the FC.  CPU and memory aren't overly
> taxed (~50% max).  Yet I can only get about 10k file creates per second,
> once I've created enough files so that caching 

You could try either increasing the number of clients (if you have more),
or as a temporary experiment try mounting the filesystem multiple times
on each client (in a different location, e.g. /mnt/lustre1, /mnt/lustre2,
etc) and run your load once for each mount point on the client.

If either of these improve the performance then the bottleneck is in this
1-RPC-per-client limitation.  If the performance does not increase there
is some other limitation.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.