[Lustre-discuss] MDT overloaded when writting small files in large number

Thu Dec 11 02:13:03 PST 2008

Andreas,

----- "Andreas Dilger" <adilger at sun.com> wrote:

> As an FYI, Lustre supports the "immutable" attribute on files (set
> via "chattr +i {filename}", so that "read-only" files can prevent clients
> from modifying the file accidentally.  It requires root to set and
> clear this flag, but it will prevent even root from modifying or
> deleting the file until it is cleared.

Good to know - thanks.

> > Another trick we used for small files was to cache them on a Lustre
> > client which then exported it over NFS. Putting plenty of RAM in the
> > NFS exporter meant that we could hold a lot of metadata and file data
> > in memory. We would then "bind" mount this over the desired branch of
> > the actual Lustre filesystem tree. This kind of defeats the purpose
> > of Lustre somewhat but can be useful for the rare cases when it can't
> > compete with NFS (like small files).
> 
> Unfortunately, metadata write-caching proxies are still down the road
> a ways for Lustre, but it is interesting to see this as a workaround.
> Do you use this in a directory where lots of files are being read, or
> also in case of lots of small file writes?  It would be highly
> strange if you could write faster into an NFS export of Lustre than to the
> native client.

We only really use this for write once/read many types of workloads. In fact we tend to use this trick only on user workstations where waiting for 30,000 small files to load into Maya can be a ten minute operation (down to around 2 minutes via an NFS cache). On the compute farm we just leave everything as native Lustre.

> Also, what version is the client in this case?  With 1.6.6 clients
> and servers the clients can grow their MDT + OST lock counts on demand,
> and the read cache limit is by default 3/4 of RAM, so one would
> expect that the native client could cache as much needed already.  The 1.8.0
> OST will also have read cache, as you know, and it would be
> interesting to know if this improves the small-file performance to NFS levels.

The client was 1.6.5 which I thought had the dynamic lock count tuning enabled too. Either way I would bump up the MDT lock count by hand to cover the likely number of small files. I am also interested to see how small file performance is effected by OST caching. However it is my experience that the slow small file performance has more to do with waiting for both the MDS and OSS RPCs to return for every file and that the read speed from disk is a small percentage of the overall time. Of course once the disk starts seeking all over the place under heavy load then it has a much greater effect.

> One of the things that we also discussed internally in the past is to
> allow storing of small files (<= 64kB for example) entirely in the
> MDT. This would allow all of the file attributes to be accessible in a
> single place, instead of the current requirement of doing 2+ RPCs to get all
> of the file attributes (MDS + OSS).

As I said it *seems* to me (scientific huh?) that much of the time between files is waiting on both of these RPCs to return. In the same way that "ls -l" type operations are always going to be slower in Lustre than a single NFS server. Putting small files on the MDT would probably help but it will make sizing up the MDT a bit trickier.

> Having feedback from you for particular weaknesses makes it much more
> likely that they will be implemented in the future.  Thanks for
> letting keeping in touch.

Here is a quick and dirty comparison of the stat() type performance of ~30,000 files using various setups:

 # find /net/epsilon/tests/meshCache -printf '%kk\t|%T@|%P\n' > /dev/null
 files in dir (Lustre):     41.54 seconds
 files in squashfs loopback: 1.73 seconds
 files in xfs loopback:      2.75 seconds
 files from NFS cache:       5.37 seconds

Daire