[Lustre-discuss] MDT overloaded when writting small files in large number

Mon Dec 8 08:21:43 PST 2008

Daire Byrne wrote:
> In the past when we've had workloads with lots of small files where each client/job has a unique dataset we have used disk image files on top of a Lustre filesystem to store them. This file is then stored on a single OST and so it reduces the overhead of going to the MDS every time - it becomes a file seek operation. We've even used squashfs archives before for write once read often small file workloads which has the added benefit of saving on disk space. However, if the dataset needs write access to many clients simultaneously then this isn't going to work.
>   
We've used the same trick for read-only stuff too.  In one case, using 
NBD to serve images that lived in a Lustre FS yielded a >12x improvement 
vs. using Lustre directly for an application that read ~80K (very) small 
files.  Unfortunately, there's no equivalent solution for writes, even 
if the writes are only occasional.  If people wanted to brainstorm about 
combinations of NFS, NBD, unionfs, and other tricks that can be combined 
with Lustre to good effect in many-small-file cases (which aren't as 
uncommon in HPC as you'd think), it might be a very worthwhile 
exercise.  I think many users and vendors hit this sooner or later, and 
agreeing on some recipes that could also be use/test cases would be to 
everyone's benefit.