[Lustre-discuss] Slow Tar File Extraction

Mon Mar 28 06:13:12 PDT 2011

[ ... ]

>> I've duplicated the issue using their tar file. It's not
>> compressed, is 126MB in size and contains 9000 files in a
>> single subdirectory, most of the files in the tar file are
>> 20K or less in size.

Many small files in a single directory are bad news for any
filesystem. Especially if they lock the directory (as they
should) and do synchronous metadata updates (as they should)
and they commit files when closed (as they should).

It still amazes me that people with 9000 small records want to
put them as individual files instead of say in ZIP/'ar' archive
(there is a reason '.a' files are used to hold many small '.o'
files) or even better a BDB or similar file, but this is so
common.

Especially for Lustre which was designed for large highly
parallel data streaming, not for small sequential metadata
workloads.

>> Here are the results:
>> * Extraction
>>   * NFS home: 10 seconds
>>   * Lustre scratch: 20 minutes
>> * md5sum for all 9000 files
>>   * NFS home: 7 seconds
>>   * Lustre scratch: 6 minutes

The NFS numbers are very low indeed. For writing that's 12MB/s
and 900 inodes/s. Unless the clients are on 100Mb/s that's
several times slower than expected. As to reading from NFS,
18MB/s and 1300 inodes/s are again quite a bit slower than the
expected 90MB/s on a 1Gb/s link (as demonstrated by a reported
speed higher than 12MB/s). Surely NFS like any nework filesystem
has performance issues in the many small files case, but it
should not be that bad.

I have seen similarly bad (or worse) numbers from a major
science facility where servers were rather missetup, but if that
is not your case then maybe check the other reasons why this is
bad, such as misconfiguration of client, network being
overloaded or lossy or misconfigured, ... as these could be
affecting the Lustre case too.

> Sorry, I typo'd the "setstripe 1" line, it should have read,
> "setstripe 1" the extraction took 3 minutes The directory had
> a stripe of 2. [ ..

The numbers for Lustre are excessively bad: striped it does
100KB/s and 8 inodes/s writing, and 350KB/s and 25 inodes/s
reading, and unstriped it does 700KB/s and 50 inodes/s writing.

Especially the large difference in inode numbers for the striped
case, as well as the suggests that the metadata updates are done
synchronously with several high latency exchanges between the
MDSes and the OSS or OSSes involved. This points to poor network
and/or disk (most likely disk) latency specially on the MDS,
often due to extreme missetup of the MDS or both.

IIRC there is a vast difference between most versions of Lustre
and with NFS as to write buffering on the client, and especially
espensive and synchronous multi-node 'stat' for Lustre.

Do the usual basic checks just to establish a baseline:

  * Bandwidth text between client and MDS, MDS and OSS, client
    and OSS using something like 'nuttcp'.
  * Copy the '.tar' file to an OST on the OSS itself (not over
    the network) using 'dd bs=1M oflag=direct'.
  * Copy the '.tar' file to Lustre from a client using 'dd bs=1M
    conv=fsync'.
  * Create 9000 empty files in a newly created directory in the
    MDT on the MDS itself.
  * Create 9000 empty files in a newly created directory in one
    OST on the OSS itself.
  * Create 9000 empty files in a newly created directory from a
    Lustre client.
  * On the MDS check the IO rates with 'iostat -xd 1'.
  * On the OSS check the IO rates with 'iostat -xd 1'.

I guess that you will have some non amusing surprises.