[Lustre-discuss] How to track down a latency/timing problem

Thu Aug 12 09:10:53 PDT 2010

 Hello Lustre Experts

I am trying to solve a problem with very slow "ls" and other big amount
of file operations but good overall read/write rates.

We are running a small cluster of 3 OSSs with 9 OSTs, 1MDS (with SSD
MDT) and currently two clients. All server nodes are centos 5.2 with
lustre 1.8.1 while the clients are centos 5.4 with lustre 1.8.3. All
components are networked with DDR IB. Striping is set to 1 or 2 for
different folders.

From the very beginning of our tests we had rather slow metadata
operations. File creation maxed at 250/s/client. "ls" of a dir with 1000
files takes about 40-70 seconds almost independently from the file´s
sizes. Dirs that have recently been accessed, are of course much faster
due to caching. There is no general performance problem as we are
getting almost 1G/s when reading/writing big files from two clients in
several threads. But when creating lots of files with lmdd in a test
script in a single thread there are also hangs of a few seconds before
the rates get back to normal.

I´ve been searching the mailing list archives for similar problems but
only found the usual "improving performance for small files" hints. All
these suggestions have been tested but did not or only slightly improve
performance. Can anyone please tell me ...

- Is there a way to check the amount of time that the different parts of
a file operation take (like 1ms requesting metadata, 1ms receiving
metadata, 123ms reading blocks from OST, ...)?
- Does anyone have a hint on what could be the problem, where to search
or what to do?

Any help is much appreciated!

Thanks!

Robert