[Lustre-discuss] Large directory performance

Bernd Schubert bs_lists at aakef.fastmail.fm
Mon Sep 13 10:46:40 PDT 2010


> > [ ... ] The really interesting data point is read performance,
> > which for these tests is just a stat of the file not reading
> > data. Starting with the smaller directories it is relatively
> > consistent at just below 2,500 files/sec, but when I jump from
> > 20,000 files to 30,000 files the performance drops to around
> > 100 files/sec.
> 
> Why is that surprising?

No, with dirindex 30000 files are not that much. In fact I could reproduce 
Mikes numbers also with smaller directory sizes. But I could bump it for a 
single node to consistently 30000 after increasing the LRU_SIZE. Now people 
might wonder why this matters if there is lru-auto-resize. Simple answer, 
several DDN customers including CSM run into serious issues with lru-auto-
resize enabled. Not all of those issues are resolved even in latest Lustre 
releases. However, I definitely need to work on patch to be able to 
disable/enable it on demand (so far each and every network reconnection resets 
it to default, so something like a cron script it required on clients to set 
the value one wants to have).

> 
> What sort of "lookups" do you think they were talking about?
> 
> On what sort of storage systems do you think you get 5,000 random
> metadata operations/s?

Really large directories suffer from the htree dirindex implementation 
returning random inode numbers instead of sequential inode numbers for 
readdir(). And that is rather sub-optimal for cp/tar/'ls -l'/etc

> 
> Can you explain how to get 5,000 *random* metadata lookup/s from
> disks that can do 50-100 random IOP/s each?
> 
> > That leaves me wondering 2 things. How can we get 5,000
> > files/sec for anything and why is our performance dropping off
> > so suddenly at after 20k files?
> 
> Why do you need to wonder?

I would expect that performance drops off after in between 100K and 1 million 
files per directory, but not 20000 yet. 

> > They are running RHEL 4.5, Lustre 6.7.2-ddn3, kernel
> > 2.6.18-128.7.1.el5.ddn1.l1.6.7.2.ddn3smp
> 
> Why are you running a very old version of Lustre (and on RHEL45
> of all things, but that is less relevant)?

1.6.7.2-ddnX  is still maintained and 1.8 also does not provide better 
metadata performance. Tests and new systems show that 1.8.3-ddn3.2 runs rather 
stable and vanilla 1.8.4 also so far seems to be mostly fine. So we start to 
encourage people to update. However, from my personal point of view, 1.8.2 was 
a draw-back for stability compared to 1.8.1.1 and it took some time to find 
out all issues. Some bugs CSM is running sometimes into, are also not yet 
fixed in 1.8. Introducing possible and unknown new issues is mostly not an 
option for production systems.

> 
> Are your running the servers in 32b or 64b mode?
> 
> > As a side note the users code is Parflow, developed at LLNL.
> > The files are SILO files. We have as many as 1.4 million files
> > in a single directory
> 
> Why hasn't LLNL hired consultants who understand the differences
> between file systems and DBMSes to help design ParFlow?

With all the knowledgeable people at LLNL, I have no idea how such an 
application ever could be written.

> > The code has already been modified to split the files on newer
> > runs until multiple subdirectories, but we're still dealing
> > with 10s of thousands of files in a single directory.
> 
> To me that's still appalling. There are very good reasons why
> file systems and DBMSes both exist, and they are not the same.
> 
> > The users have been able to run these data sets on Lustre
> > systems at LLNL 3 orders of magnitude faster.
> 
> Do you think that LLNL have metadata storage and caches as weak
> as yours?

I definitely know that LLNL was working and pushing lru resize into 1.6.5. 
That might explain why. Unfortunately, as I said before, that brought up some 
serious new issues not solved yet until now.

I also entirely agree, that the application is not suitable for a Lustre 
filesystem, even if LLNL should have found some workarounds.

Cheers,
Bernd

-- 
Bernd Schubert
DataDirect Networks



More information about the lustre-discuss mailing list