[lustre-discuss] Improving file create performance with larger create_count

Nathan Dauchy - NOAA Affiliate nathan.dauchy at noaa.gov
Fri Jan 8 15:27:54 PST 2021

Andreas, thanks for the insight and advice.  Followup details inline

On Thu, Jan 7, 2021 at 9:56 PM Andreas Dilger <adilger at whamcloud.com> wrote:

> On Jan 7, 2021, at 08:54, Nathan Dauchy - NOAA Affiliate wrote:
> I am looking for assistance on how to improve file create rate, as
> measured with MDtest.
> one thing to pay attention to is how large your PFL layout is.  If it is 4
> components and/or you have an MDT formatted with an older (pre 2.10)
> version of Lustre, then the amount of xattr space for the layout directly
> in the MDT inode is relatively small.  Limiting the PFL layout to 3
> components and having an MDT inode size of 1KB should avoid the extra
> overhead of writing to a separate xattr block for each create.

The PFL layout has 5 components, setup like this:
lfs setstripe -E 128K -c 1 -S 128K -p SSD -E 32M -S 1M -c 1 -p HDD -E 1G -c
4 -S 64M -E 32G -c 8 -E -1 -c 30

The MDT is brand new, formatted with 2.12+.
Using "dumpe2fs -h" it appears as though the inode size is indeed 1KB:
dumpe2fs 1.45.2.cr2 (09-Apr-2020)
Inode size:          1024

Does the following debugfs command confirm or refute whether file layouts
created with that config are staying within the MDT inode?  (I'm not sure
if I can just do 32+24+632+47 = 735 < 1024, or if there is another constant
value that needs to be added in.)

MDS3# debugfs -c -R 'stat
/REMOTE_PARENT_DIR/0x2c000f630:0x1:0x0/Nathan.Dauchy/empty' /dev/md66
Size of extra inode fields: 32
Extended attributes:
  trusted.lma (24) = 00 00 00 00 00 00 00 00 33 f6 00 c0 02 00 00 00 02 00
00 00 00 00 00 00
  lma: fid=[0x2c000f633:0x2:0x0] compat=0 incompat=0
  trusted.lov (632)
  trusted.link (47)
  trusted.som (24) = 04 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00

If I use a 3-component layout and an empty file like this:
lfs setstripe -E 32M -S 1M -c 1 -p HDD -E 1G -c 4 -S 64M -E -1 -c 8 3comp
...then the "trusted.lov" value is the only one that decreases:
  trusted.lma (24) = 00 00 00 00 00 00 00 00 33 f6 00 c0 02 00 00 00 07 00
00 00 00 00 00 00
  lma: fid=[0x2c000f633:0x7:0x0] compat=0 incompat=0
  trusted.link (47)
  trusted.lov (392)
  trusted.som (24) = 04 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00

The other potential source of overhead is if the flash OSTs have a very
> large number of objects vs. the HDD OSTs (e.g. hundreds of millions if
> there is *always* a flash component for each file and they are never
> removed from large files during OST migration) which make the OST object
> directories very large, and each create essentially goes into a separate
> directory block.  The saving grace in your case is that the flash OSTs have
> higher IOPS, so they _should_ be able to keep up with the HDD OSTs even if
> they have a higher workload.

I don't believe this applies (to us at least) as it is a new and fairly
empty filesystem.  Thanks for the heads up for what to watch for in the
future though!

> You could try isolating whether the bottleneck is from the 2 flash OSTs to
> 32 HDD OSTs, or because of the large layout size.  Try creating 1-stripe
> files in individual directories that are each setstripe to create only on a
> single OST.  That should show what the single OST create rate limit is.
> You could compare this against the no-stripe create rate for the MDS.
> On the flip side, you could create files in a directory with a large
> layout that has many components, like:
>   lfs setstripe -E 16M -c 1 -i 1 -E 32M -i 2 -E 48M -i 3 -E eof -i 4
>  $MOUNT/ost1largedir
> to see how this affects the create rate.  The OST objects would be
> consumed 4x as fast, but from 4x OSTs (assume they are the HDD OSTs) so
> they _should_ be creating/consuming OST objects at the same rate as a
> 1-stripe file file, except the large layout size will push this to a
> separate xattr block and put much more IOPS onto the MDS.  You could try
> with the two flash OSTs, two objects on each of the two OSTs, which should
> get you 1/2 the create rate if the OST create rate is the limiting factor.
> If this tanks performance then the issue is with the large layout xattr and
> not the OSTs.

Additional benchmarking is in progress already.  If we need more clues I'll
add that to the list.

> I have found the "create_count" and "max_create_count" tunables, but the
> Lustre manual only references those in the context of removing or disabling
> an OST.  So either the manual is incomplete or I'm looking at the wrong
> tunable.
> The important place to check the create_count/max_create_count on the MDS,
> since it is the one driving object creates to the OSTs.

Understood, that "MGS" was my typo replacing a hostname.  It's the "MDS".

> * Are there other tunings we should be looking at to improve performance
> with Progressive File Layouts and our particular balance of just 2 Flash
> OSTs to 32 HDD OSTs?
> Depends what the above results show.  There is the
> "obdfilter.*.precreate_batch" tunable, which can help optimize OST creates
> if there is a lot of lock contention on the object directories for create
> vs. concurrent IO, but it is unlikely to be an issue under normal usage.
> If the problem is that the OSTs have huge numbers of objects and large
> object directories there are other potential optimizations possible.

With the simple mdtest 0-byte file test, it doesn't sound like contention
would be the problem.  Neither is it that the OSTs have huge number of
objects (yet).

We will keep digging.

Thanks again,
