[Lustre-discuss] lustre and small files overhead

Fri Feb 29 11:46:02 PST 2008

On Feb 29, 2008  15:37 +0100, Joe Barjo wrote:
> We have a (small) 30 node sge based cluster with centos4 which will be
> growing to maximum 50 core duos.
> We use custom software that is based on gmake to launch parallel
> compilation and computations with lot of small files and some large files.
> We actualy use nfs and have a lot of problems with incoherencies between
> nodes.
> 
> I'm currently evaluating lustre and have some questions about lustre
> overhead with small files.
> I succesfully installed the rpms on a test machine and launched the
> local lmount.sh script.

Note that if you are using the unmodified llmount.sh script this is running
on loopback files in /tmp, so the performance is likely quite bad.  For
a realistic performance measure, put the MDT and OST on separate disks.

> The first thing I tried is to make a svn checkout into it. (lot of small
> files...)
> It takes 1m54 from our local svn server versus 15s into a local ext3
> filesystem and 50s over nfs network.
> During the checkout, the processor (amd64 3200) is busy with 90% system.
> 
> How come is there so much system process?

Have you turned off debugging (sysctl -w lnet.debug=0)?
Have you increased the DLM lock LRU sizes?

for L in /proc/fs/lustre/ldlm/namespaces/*/lru_size; do
    echo 10000 > $L
done

In 1.6.5/1.8.0 it will be possible to use a new command to set
this kind of parameter easier:

lctl set_param ldlm.namespaces.*.lru_size=10000

> Is there something to tweak to lower this overhead?
> Is there a specific tweak for small files?

Not really, this isn't Lustre's strongest point.

> Using multiple server nodes, will the performance be better?

Partly.  There can only be a single MDT per filesystem, but it can
scale quite well with multiple clients.  There can be many OSTs,
but it isn't clear whether you are IO bound.  It probably wouldn't
hurt to have a few to give you a high IOPS rate.

Note that increasing OST count also by default allows clients to
cache more dirty data (32MB/OST).  You can change this manually,
it is by default tuned for very large clusters (000's of nodes).

for C in /proc/fs/lustre/osc/*/max_dirty_mb
	echo 256 > $C
done

Similarly, in 1.6.5/1.8.0 it will be possible to do:

lctl set_param osc.*.max_dirty_mb=256

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.