[Lustre-discuss] lustre and small files overhead

Sun Mar 9 21:53:15 PDT 2008

On Mar 07, 2008  12:49 +0100, Joe Barjo wrote:
> I made some more tests, and have setup a micro lustre cluster on lvm
> volumes.
> node a: MDS
> node b and c: OST
> node a,b,c,d,e,f: clients
> Gigabit ethernet network.
> Made the optimizations: lnet.debug=0, lru_size to 10000, max_dirty_mb to
> 1024

For high RPC-rate operations using an interconnect like Infiniband is
better than ethernet.

> The svn checkout takes 50s ( 15s on a localdisk, 25s on a local lustre
> demo (with debug=0))
> Launching gkrellm, a single svn checkout consumes about 20% of the MDS
> system cpu with about 2.4mbyte/sec ethernet communication.

> About 6MByte/s disk bandwidth on OST1, up to 12-16MB/s on OST2 disk
> bandwidth, network bandwidth on OST is about 10 to 20 times under disk
> bandwidth.
> Why so much disk bandwidth on OSTs, is it a readahead problem?

That does seem strange, I can't really say why.  There is some metadata
overhead, and that is higher with small files, but I don't think it
would be 10-20x overhead.

> I launched a compilation distributed on the 6 clients:
> MDS system cpu goes up to 60% system ressource (athlon 64 3500+)
> 12MByte/s on the ethernet, OST goes up to the same level as previous test.
> 
> How come is there so much network communications on the MDT?

Because every metadata operation has to be done on the MDS currently.
We are working toward having metadata writeback cache operations on
the client, but it doesn't happen currently.  For operations like
compilation it is basically entirely metadata overhead.

> As I understood that the MDS can not be load balanced, I don't see how
> lustre is scalable to thousands of clients...

Because in many HPC environments there are very few metadata operations
in comparison to the amount of data being read/written.  Average file
sizes are 20-30MB instead of 20-30kB.

> It looks like lustre is not made for this kind of application

No, it definitely isn't tuned for small files.

> Best regards.
> Andreas Dilger a écrit :
> >
> > Have you turned off debugging (sysctl -w lnet.debug=0)?
> > Have you increased the DLM lock LRU sizes?
> >
> > for L in /proc/fs/lustre/ldlm/namespaces/*/lru_size; do
> >     echo 10000 > $L
> > done
> >
> > In 1.6.5/1.8.0 it will be possible to use a new command to set
> > this kind of parameter easier:
> >
> > lctl set_param ldlm.namespaces.*.lru_size=10000
> >
> >   
> >> Is there something to tweak to lower this overhead?
> >> Is there a specific tweak for small files?
> >>     
> >
> > Not really, this isn't Lustre's strongest point.
> >
> >   
> >> Using multiple server nodes, will the performance be better?
> >>     
> >
> > Partly.  There can only be a single MDT per filesystem, but it can
> > scale quite well with multiple clients.  There can be many OSTs,
> > but it isn't clear whether you are IO bound.  It probably wouldn't
> > hurt to have a few to give you a high IOPS rate.
> >
> > Note that increasing OST count also by default allows clients to
> > cache more dirty data (32MB/OST).  You can change this manually,
> > it is by default tuned for very large clusters (000's of nodes).
> >
> > for C in /proc/fs/lustre/osc/*/max_dirty_mb
> > 	echo 256 > $C
> > done
> >
> > Similarly, in 1.6.5/1.8.0 it will be possible to do:
> >
> > lctl set_param osc.*.max_dirty_mb=256
> >
> > Cheers, Andreas
> > --
> > Andreas Dilger
> > Sr. Staff Engineer, Lustre Group
> > Sun Microsystems of Canada, Inc.
> >
> >
> >   
> 

> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.