[lustre-discuss] Improving file create performance with larger create_count

Nathan Dauchy - NOAA Affiliate nathan.dauchy at noaa.gov
Fri Jan 8 15:27:54 PST 2021


Andreas, thanks for the insight and advice.  Followup details inline
below...

On Thu, Jan 7, 2021 at 9:56 PM Andreas Dilger <adilger at whamcloud.com> wrote:

> On Jan 7, 2021, at 08:54, Nathan Dauchy - NOAA Affiliate wrote:
>
> I am looking for assistance on how to improve file create rate, as
> measured with MDtest.
>
> one thing to pay attention to is how large your PFL layout is.  If it is 4
> components and/or you have an MDT formatted with an older (pre 2.10)
> version of Lustre, then the amount of xattr space for the layout directly
> in the MDT inode is relatively small.  Limiting the PFL layout to 3
> components and having an MDT inode size of 1KB should avoid the extra
> overhead of writing to a separate xattr block for each create.
>

The PFL layout has 5 components, setup like this:
lfs setstripe -E 128K -c 1 -S 128K -p SSD -E 32M -S 1M -c 1 -p HDD -E 1G -c
4 -S 64M -E 32G -c 8 -E -1 -c 30

The MDT is brand new, formatted with 2.12+.
Using "dumpe2fs -h" it appears as though the inode size is indeed 1KB:
dumpe2fs 1.45.2.cr2 (09-Apr-2020)
Inode size:          1024

Does the following debugfs command confirm or refute whether file layouts
created with that config are staying within the MDT inode?  (I'm not sure
if I can just do 32+24+632+47 = 735 < 1024, or if there is another constant
value that needs to be added in.)

MDS3# debugfs -c -R 'stat
/REMOTE_PARENT_DIR/0x2c000f630:0x1:0x0/Nathan.Dauchy/empty' /dev/md66
Size of extra inode fields: 32
Extended attributes:
  trusted.lma (24) = 00 00 00 00 00 00 00 00 33 f6 00 c0 02 00 00 00 02 00
00 00 00 00 00 00
  lma: fid=[0x2c000f633:0x2:0x0] compat=0 incompat=0
  trusted.lov (632)
  trusted.link (47)
  trusted.som (24) = 04 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00

If I use a 3-component layout and an empty file like this:
lfs setstripe -E 32M -S 1M -c 1 -p HDD -E 1G -c 4 -S 64M -E -1 -c 8 3comp
...then the "trusted.lov" value is the only one that decreases:
  trusted.lma (24) = 00 00 00 00 00 00 00 00 33 f6 00 c0 02 00 00 00 07 00
00 00 00 00 00 00
  lma: fid=[0x2c000f633:0x7:0x0] compat=0 incompat=0
  trusted.link (47)
  trusted.lov (392)
  trusted.som (24) = 04 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00

The other potential source of overhead is if the flash OSTs have a very
> large number of objects vs. the HDD OSTs (e.g. hundreds of millions if
> there is *always* a flash component for each file and they are never
> removed from large files during OST migration) which make the OST object
> directories very large, and each create essentially goes into a separate
> directory block.  The saving grace in your case is that the flash OSTs have
> higher IOPS, so they _should_ be able to keep up with the HDD OSTs even if
> they have a higher workload.
>

I don't believe this applies (to us at least) as it is a new and fairly
empty filesystem.  Thanks for the heads up for what to watch for in the
future though!


> You could try isolating whether the bottleneck is from the 2 flash OSTs to
> 32 HDD OSTs, or because of the large layout size.  Try creating 1-stripe
> files in individual directories that are each setstripe to create only on a
> single OST.  That should show what the single OST create rate limit is.
> You could compare this against the no-stripe create rate for the MDS.
>
> On the flip side, you could create files in a directory with a large
> layout that has many components, like:
>
>   lfs setstripe -E 16M -c 1 -i 1 -E 32M -i 2 -E 48M -i 3 -E eof -i 4
>  $MOUNT/ost1largedir
>
> to see how this affects the create rate.  The OST objects would be
> consumed 4x as fast, but from 4x OSTs (assume they are the HDD OSTs) so
> they _should_ be creating/consuming OST objects at the same rate as a
> 1-stripe file file, except the large layout size will push this to a
> separate xattr block and put much more IOPS onto the MDS.  You could try
> with the two flash OSTs, two objects on each of the two OSTs, which should
> get you 1/2 the create rate if the OST create rate is the limiting factor.
> If this tanks performance then the issue is with the large layout xattr and
> not the OSTs.
>

Additional benchmarking is in progress already.  If we need more clues I'll
add that to the list.

> I have found the "create_count" and "max_create_count" tunables, but the
> Lustre manual only references those in the context of removing or disabling
> an OST.  So either the manual is incomplete or I'm looking at the wrong
> tunable.
>
> The important place to check the create_count/max_create_count on the MDS,
> since it is the one driving object creates to the OSTs.
>

Understood, that "MGS" was my typo replacing a hostname.  It's the "MDS".

> * Are there other tunings we should be looking at to improve performance
> with Progressive File Layouts and our particular balance of just 2 Flash
> OSTs to 32 HDD OSTs?
>
>
> Depends what the above results show.  There is the
> "obdfilter.*.precreate_batch" tunable, which can help optimize OST creates
> if there is a lot of lock contention on the object directories for create
> vs. concurrent IO, but it is unlikely to be an issue under normal usage.
> If the problem is that the OSTs have huge numbers of objects and large
> object directories there are other potential optimizations possible.
>

With the simple mdtest 0-byte file test, it doesn't sound like contention
would be the problem.  Neither is it that the OSTs have huge number of
objects (yet).

We will keep digging.

Thanks again,
Nathan
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20210108/017b4d15/attachment.html>


More information about the lustre-discuss mailing list