[Lustre-discuss] MDT overloaded when writting small files in large number

Mon Dec 8 07:23:25 PST 2008

Siva,

In the past when we've had workloads with lots of small files where each client/job has a unique dataset we have used disk image files on top of a Lustre filesystem to store them. This file is then stored on a single OST and so it reduces the overhead of going to the MDS every time - it becomes a file seek operation. We've even used squashfs archives before for write once read often small file workloads which has the added benefit of saving on disk space. However, if the dataset needs write access to many clients simultaneously then this isn't going to work.

Another trick we used for small files was to cache them on a Lustre client which then exported it over NFS. Putting plenty of RAM in the NFS exporter meant that we could hold a lot of metadata and file data in memory. We would then "bind" mount this over the desired branch of the actual Lustre filesystem tree. This kind of defeats the purpose of Lustre somewhat but can be useful for the rare cases when it can't compete with NFS (like small files).

Daire

----- "siva murugan" <siva.murugan at gmail.com> wrote:

> We are trying to uptake Lustre in one of our heavy read/write
> intensive infrastructure(daily writes - 8million files, 1TB ). Average
> size of files written is 1KB ( I know , Lustre can't scale well for
> small size files, but just wanted to analyze the possibility of
> uptaking )
> 
> Following are some of the tests conducted to see the difference in
> large and small file size writting,
> 
> MDT - 1
> OST - 13 ( also act as nfsserver)
> Clients access lustrefs via NFS ( not patchless clients)
> 
> Test 1 :
> 
> Number of clients - 10
> Dataset size read/written - 971M (per client)
> Number of files in the dataset- 14000
> Total data written - 10gb
> 
> Time taken - 1390s
> 
> Test2 :
> 
> Number of clients - 10
> Dataset size read/written -1001M (per client)
> Number of files in the dataset - 4
> Total data written - 10gb
> 
> Time taken - 215s
> 
> 
> Test3 :
> 
> Number of clients - 10
> Dataset size read.written- 53MB (per client)
> Number of files in the dataset- 14000
> Total data written - 530MB
> Time taken - 1027s
> 
> MDT was heavily loaded during Test3 ( Load average > 25 ). Since the
> file size in Test 3 is small(1kb) and number of files written is too
> large(14000 x 10 clients ), obvisouly mdt gets loaded in allocating
> inodes, total data written in test3 is only 530MB.
> 
> Question : Is there any parameter that I can tune in MDT to increase
> the performance when writting large number of small files ?
> 
> Please help
> 
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss