[lustre-discuss] Improving file create performance with larger create_count

Thu Jan 7 20:56:45 PST 2021

On Jan 7, 2021, at 08:54, Nathan Dauchy - NOAA Affiliate <nathan.dauchy at noaa.gov<mailto:nathan.dauchy at noaa.gov>> wrote:

Greetings Lustre Experts!

I am looking for assistance on how to improve file create rate, as measured with MDtest.

In particular, this is for filesystems with (4) MDTs that use progressive file layouts (PFL) to place the first part of each file on one of the (2) Flash OSTs, with the remainder of large files on HDD OSTs using increasing stripe count (up to 32) as the files get larger.
MDtest file create performance of this configuration is significantly lower (160k vs. 550k) than when using a simple stripe count of 1 on HDDs alone.  MDtest was run with 0-size files (not using "-w") so no data should actually be written to the OSTs, and the IOPS and CPU and Network of the Flash OSSs should be plenty to sustain higher performance.

Nathan,
one thing to pay attention to is how large your PFL layout is.  If it is 4 components and/or you have an MDT formatted with an older (pre 2.10) version of Lustre, then the amount of xattr space for the layout directly in the MDT inode is relatively small.  Limiting the PFL layout to 3 components and having an MDT inode size of 1KB should avoid the extra overhead of writing to a separate xattr block for each create.

The other potential source of overhead is if the flash OSTs have a very large number of objects vs. the HDD OSTs (e.g. hundreds of millions if there is *always* a flash component for each file and they are never removed from large files during OST migration) which make the OST object directories very large, and each create essentially goes into a separate directory block.  The saving grace in your case is that the flash OSTs have higher IOPS, so they _should_ be able to keep up with the HDD OSTs even if they have a higher workload.

You could try isolating whether the bottleneck is from the 2 flash OSTs to 32 HDD OSTs, or because of the large layout size.  Try creating 1-stripe files in individual directories that are each setstripe to create only on a single OST.  That should show what the single OST create rate limit is.  You could compare this against the no-stripe create rate for the MDS.

On the flip side, you could create files in a directory with a large layout that has many components, like:

  lfs setstripe -E 16M -c 1 -i 1 -E 32M -i 2 -E 48M -i 3 -E eof -i 4  $MOUNT/ost1largedir

to see how this affects the create rate.  The OST objects would be consumed 4x as fast, but from 4x OSTs (assume they are the HDD OSTs) so they _should_ be creating/consuming OST objects at the same rate as a 1-stripe file file, except the large layout size will push this to a separate xattr block and put much more IOPS onto the MDS.  You could try with the two flash OSTs, two objects on each of the two OSTs, which should get you 1/2 the create rate if the OST create rate is the limiting factor.  If this tanks performance then the issue is with the large layout xattr and not the OSTs.

Our vendor pointed out that we are likely limited by the number of "precreated objects" for the flash OSTs (since there are only two of them, handling all files) and that it can be increased as an optimization.

Sheesh, those vendors... :-)

I have found the "create_count" and "max_create_count" tunables, but the Lustre manual only references those in the context of removing or disabling an OST.  So either the manual is incomplete or I'm looking at the wrong tunable.

If create_count is indeed the right parameter to adjust...
* What is the relation between create_count and max_create_count?  And why would I never see create_count more than half of max?
MGS# lctl get_param osc.*OST0000-osc-MDT0000*.*create_count
osc.FS1-OST0000-osc-MDT0000.create_count=10000
osc.FS1-OST0000-osc-MDT0000.max_create_count=20000

The important place to check the create_count/max_create_count on the MDS, since it is the one driving object creates to the OSTs.

* Is there a hard-coded limit to the max value?
MGS# lctl set_param osc.FS1-OST0000-osc-MDT0000.max_create_count=40000
error: set_param: setting /sys/fs/lustre/osc/FS1-OST0000-osc-MDT0000/max_create_count=40000: Numerical result out of range

The max_create_count is between 32 and 20000 (for protocol recovery reasons, since unused precreated objects are destroyed during recovery, and we put a cap on how many objects could be destroyed to avoid badness in case of a bug) so this is already at the maximum.  You should be able to increase the create_count to 20000 as well. However, this value is "auto tuned" based on how long it takes the OSS to create the requested objects.  If the OST_CREATE RPC takes too long then the MDS will ask for fewer objects next time.

* Is there a theoretical down side to pre-creating more objects?  (MDS or OSS memory usage?  Longer mount times? slower e2fsck?)

A bit slower e2fsck, but compared to the total filesystem size this is minor.  The biggest issue is that the old precreated objects will be destroyed during MDS-OSS recovery and new ones created.

* Are there other tunings we should be looking at to improve performance with Progressive File Layouts and our particular balance of just 2 Flash OSTs to 32 HDD OSTs?

Depends what the above results show.  There is the "obdfilter.*.precreate_batch" tunable, which can help optimize OST creates if there is a lot of lock contention on the object directories for create vs. concurrent IO, but it is unlikely to be an issue under normal usage.  If the problem is that the OSTs have huge numbers of objects and large object directories there are other potential optimizations possible.

Cheers, Andreas
--
Andreas Dilger
Principal Lustre Architect
Whamcloud

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20210108/6e148290/attachment.html>