[Lustre-discuss] Large directory performance

Mon Sep 13 07:22:34 PDT 2010

> We have been struggling with our Lustre performance for some
> time now especially with large directories.

Are you assuming that Lustre has been designed for good
performance with lots of (probably tiny) files in large
directories?

> I recently did some informal benchmarking (on a live system so
> I know results are not scientifically valid) and noticed a huge
> drop in performance of reads(stat operations) past 20k files in
> a single directory.

Is a a benchmark really needed to figure that out?

> I'm using bonnie++, disabling IO testing (-s 0) and just
> creating, reading, and deleting 40kb files in a single
> directory.

What do you think Bonnie++ is a benchmark of?

> [ ... ] The really interesting data point is read performance,
> which for these tests is just a stat of the file not reading
> data. Starting with the smaller directories it is relatively
> consistent at just below 2,500 files/sec, but when I jump from
> 20,000 files to 30,000 files the performance drops to around
> 100 files/sec.

Why is that surprising?

> [ ... ] are in the process of trying to get our users to
> change their code. [ ... ]

But as mentioned below it is being changed in a way that will
help but not a lot.

> Then yesterday I was browsing the Lustre Operations Manual

Did you read it before designing and setting up your system?

There are relevant bits of advice in 1.4.2.2 and 10.1.1-4 for
example (some of them objectionable, such as recommending RAID6
for data storage, without the necessary qualifications at the
very least).

> and found section 33.8 that says Lustre is tested with
> directories as large as 10 million files in a single directory

Why would "tested" imply "works real fast in every possible,
including really stupid, setup"?

> and still get lookups at a rate of 5,000 files/sec.

What sort of "lookups" do you think they were talking about?

On what sort of storage systems do you think you get 5,000 random
metadata operations/s?

Can you explain how to get 5,000 *random* metadata lookup/s from
disks that can do 50-100 random IOP/s each?

> That leaves me wondering 2 things. How can we get 5,000
> files/sec for anything and why is our performance dropping off
> so suddenly at after 20k files?

Why do you need to wonder?

Have you read about new amazing techniques like caching in
RAM/flash and scaling via RAID?

Have your read the extensive discussions of metadata and data
performance in the Lustre docs?

> Here is our setup: All IO servers are Dell PowerEdge 2950s. 2
> 8-core sockets with X5355 @ 2.66GHz and 16Gb of RAM. The data
> is on DDN S2A 9550s with 8+2 RAID configuration connected
> directly with 4Gb Fibre channel.

Why do you describe where the data is when you have so far talked
only about the netadata?

Do you have a good idea of the differences (and the different
workloads a described in the Lustre manual) between MDS/MDTs and
OSSes/OSTs?

ALso, if you have a highly parallel program that deals with what
look like millions of tiny files (which looks like an appalling
misdesign to me), why do you run it on a RAID3 (of all things)
storage system?

If you storing the metadata for Lustre on the same storage system
as the data *and* it is a RAID3 setup, WHY WHY WHY?

Why haven't you hired Sun/Oracle consultants to design and
configure your metadata and data storage systems?

> They are running RHEL 4.5, Lustre 6.7.2-ddn3, kernel
> 2.6.18-128.7.1.el5.ddn1.l1.6.7.2.ddn3smp

Why are you running a very old version of Lustre (and on RHEL45
of all things, but that is less relevant)?

Are your running the servers in 32b or 64b mode?

> As a side note the users code is Parflow, developed at LLNL.
> The files are SILO files. We have as many as 1.4 million files
> in a single directory

Why hasn't LLNL hired consultants who understand the differences
between file systems and DBMSes to help design ParFlow?

> and we now have half a billion files that we need to deal with
> in one way or another.

To me that means that the application is appallingly written
(there are a lot of those about).

Then perhaps your setup is entirely inappropriate for most types
of workload and even more so for metadata intensive ones, and
maybe Lustre was designed for optimal performnance on large
streaming workloads, so what looks to me an appallingly
misdesigned application works particularly badly in your case.

> The code has already been modified to split the files on newer
> runs until multiple subdirectories, but we're still dealing
> with 10s of thousands of files in a single directory.

To me that's still appalling. There are very good reasons why
file systems and DBMSes both exist, and they are not the same.

> The users have been able to run these data sets on Lustre
> systems at LLNL 3 orders of magnitude faster.

Do you think that LLNL have metadata storage and caches as weak
as yours?

Given how the application is "designed", would it suffer a
colossal performance drop at LLNL too on a suitably larger data
set?

Have you realized by now that Lustre performance is very, very
anisotropic in the space of possible setups and applications?