[Lustre-discuss] MDT overloaded when writting small files in large number

Wed Dec 10 09:32:26 PST 2008

On Dec 08, 2008  15:23 +0000, Daire Byrne wrote:
> In the past when we've had workloads with lots of small files where
> each client/job has a unique dataset we have used disk image files on
> top of a Lustre filesystem to store them. This file is then stored on a
> single OST and so it reduces the overhead of going to the MDS every time
> - it becomes a file seek operation. We've even used squashfs archives
> before for write once read often small file workloads which has the
> added benefit of saving on disk space. However, if the dataset needs
> write access to many clients simultaneously then this isn't going to work.

Daire, this seems like a very useful trick - thanks for sharing.  One
could even think of this as delegating a whole sub-tree to the OST,
though of course due to the nature of such local filesystems they could
not be used by more than one client if they are being changed.

As an FYI, Lustre supports the "immutable" attribute on files (set via
"chattr +i {filename}", so that "read-only" files can prevent clients
from modifying the file accidentally.  It requires root to set and
clear this flag, but it will prevent even root from modifying or
deleting the file until it is cleared.

> Another trick we used for small files was to cache them on a Lustre
> client which then exported it over NFS. Putting plenty of RAM in the
> NFS exporter meant that we could hold a lot of metadata and file data
> in memory. We would then "bind" mount this over the desired branch of
> the actual Lustre filesystem tree. This kind of defeats the purpose
> of Lustre somewhat but can be useful for the rare cases when it can't
> compete with NFS (like small files).

Unfortunately, metadata write-caching proxies are still down the road
a ways for Lustre, but it is interesting to see this as a workaround.
Do you use this in a directory where lots of files are being read, or
also in case of lots of small file writes?  It would be highly strange
if you could write faster into an NFS export of Lustre than to the
native client.

Also, what version is the client in this case?  With 1.6.6 clients and
servers the clients can grow their MDT + OST lock counts on demand,
and the read cache limit is by default 3/4 of RAM, so one would expect
that the native client could cache as much needed already.  The 1.8.0
OST will also have read cache, as you know, and it would be interesting
to know if this improves the small-file performance to NFS levels.

One of the things that we also discussed internally in the past is to
allow storing of small files (<= 64kB for example) entirely in the MDT.
This would allow all of the file attributes to be accessible in a single
place, instead of the current requirement of doing 2+ RPCs to get all
of the file attributes (MDS + OSS).

It might even be possible to do whole-file readahead from the MDS -
when the file is opened for read the MDS returns the full file contents
along with the attributes and a lock to the client, avoiding any other
RPCs.

Having feedback from you for particular weaknesses makes it much more
likely that they will be implemented in the future.  Thanks for letting
keeping in touch.

> ----- "siva murugan" <siva.murugan at gmail.com> wrote:
> > We are trying to uptake Lustre in one of our heavy read/write
> > intensive infrastructure(daily writes - 8million files, 1TB ). Average
> > size of files written is 1KB ( I know , Lustre can't scale well for
> > small size files, but just wanted to analyze the possibility of
> > uptaking )
> > 
> > Following are some of the tests conducted to see the difference in
> > large and small file size writting,
> > 
> > MDT - 1
> > OST - 13 ( also act as nfsserver)
> > Clients access lustrefs via NFS ( not patchless clients)
> > 
> > Test 1 :
> > 
> > Number of clients - 10
> > Dataset size read/written - 971M (per client)
> > Number of files in the dataset- 14000
> > Total data written - 10gb
> > 
> > Time taken - 1390s
> > 
> > Test2 :
> > 
> > Number of clients - 10
> > Dataset size read/written -1001M (per client)
> > Number of files in the dataset - 4
> > Total data written - 10gb
> > 
> > Time taken - 215s
> > 
> > 
> > Test3 :
> > 
> > Number of clients - 10
> > Dataset size read.written- 53MB (per client)
> > Number of files in the dataset- 14000
> > Total data written - 530MB
> > Time taken - 1027s
> > 
> > MDT was heavily loaded during Test3 ( Load average > 25 ). Since the
> > file size in Test 3 is small(1kb) and number of files written is too
> > large(14000 x 10 clients ), obvisouly mdt gets loaded in allocating
> > inodes, total data written in test3 is only 530MB.
> > 
> > Question : Is there any parameter that I can tune in MDT to increase
> > the performance when writting large number of small files ?
> > 
> > Please help
> > 
> > _______________________________________________
> > Lustre-discuss mailing list
> > Lustre-discuss at lists.lustre.org
> > http://lists.lustre.org/mailman/listinfo/lustre-discuss
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.