[lustre-discuss] DNE mkdir performance

Rungta, Vandana vrungta at amazon.com
Mon Jul 14 18:46:04 PDT 2025


Andreas and WangDi,

Your comments and suggestions have been very helpful in getting context around this area.  Our crash testing validates that we could recover if one MDT had committed the data. Looking forward to your comments Lai and Alex.

Thanks,
Vandana

From: Di Wang <ddiwang at google.com>
Date: Monday, July 14, 2025 at 8:29 AM
To: Andreas Dilger <adilger at ddn.com>
Cc: "Rungta, Vandana" <vrungta at amazon.com>, "lustre-discuss at lists.lustre.org" <lustre-discuss at lists.lustre.org>
Subject: RE: [EXTERNAL] [lustre-discuss] DNE mkdir performance


CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.

Hello,

IIRC, the reordering actually forces the update log RPC to be serialized. (see LU-7426).
Hmm, this osd_sync was actually introduced by COS for DNE(LU-3538), to ensure synchronization recovery, before cross-MDT recovery is supported.
Though as Andreas mentioned, there should be enough information saved in the DNE recovery log on all MDTs now, we may not need this sync any more.  Lai, Alex, could you please comment? Thanks.

Thanks
WangDi

On Sun, Jul 13, 2025 at 12:15 AM Andreas Dilger <adilger at ddn.com<mailto:adilger at ddn.com>> wrote:
Wang Di write this code originally, and might be able to comment on it better, but my recollection is that it is that the sync is needed to handle the case where there are multiple changes to the DNE recovery llog due to bitmap updates in the llog header possibly being reordered.

Otherwise, there should be enough information saved on at least one of the MDTs to recover the whole operation if *any* of them have saved part of it.

There is some work being done by Alex Z. to improve llog performance in patch https://review.whamcloud.com/c/fs/lustre-release/+/57456<https://review.whamcloud.com/c/fs/lustre-release/+/57456:%20LU-18562%20osd:%20improve%20llog%20writes>  ("LU-18562 osd: improve llog writes") and https://review.whamcloud.com/c/fs/lustre-release/+/57261 ("57261: LU-7426 obdclass: indexed llog"), the latter of which should avoid the requirement for sync of the llog bitmap.

Until that point, I don't think it is totally safe to remove the sync for distributed transactions, though a 5.5x speedup is definitely attractive. The best I could suggest is to add a tunable to disable this sync, like "mdd.*.enable_dne_remote_sync" that defaults to 1, so that it keeps the current behavior, but can be disabled to testing or if performance is more critical than preserving new directories across a full system outage.

Please file a ticket in Jira to track this change.

Cheers, Andreas


On Jul 11, 2025, at 15:06, Rungta, Vandana via lustre-discuss <lustre-discuss at lists.lustre.org<mailto:lustre-discuss at lists.lustre.org>> wrote:
Summary:
When creating a directory (mkdir), lustre does not “sync” by default when there is a single mdt.  With multiple mdts where the child directory is a created on a different mdt than the parent (cross mdt mkdir), lustre does an osd_sync, which we suspect is for atomicity. Our experiments show that if we disable the osd_sync in the cross-mdt case, we don’t lose atomicity and system recovers if any one of the 3 hosts involved is available (similar to the single mdt case) So, we are wondering if this “osd-sync” is needed in the cross-mdt case, as the call to sync degrades performance.

Issue:
In a Lustre Distributed Namespace Environment (DNE) featuring multiple Metadata Targets (MDTs), the process of creating remote directories is notably slower compared to a single MDT file system utilizing the osd-zfs backend.

This performance issue can be consistently replicated using a single client, specifically by creating approximately 1000 child directories with the command lfs mkdir -i 1 . The parent directory is part of MDT-0, while the child directories are created on MDT-1, following a pattern such as /parent/child-0, /parent/child-1, etc.

  *   Creating 1000 child directories on Parent MDT (MDT0) takes ~0.9 sec and
  *   Creating 1000 child directories on remote MDT  (parent directory on MDT0, and child directory on MDT1) takes ~12 sec
Testing using mdtest with mpirun involving two clients and 50 iterations,  directories are generated in a round-robin fashion to utilize both MDTs, as demonstrated by the command "mpirun -mca routed direct -map-by node -np 16 mdtest -n 625 -i 50 -u -d /lfs/mdtest".


A
B
C
D
1
Operation
Directory Operations/Sec
2

With Single MDT
With 2 MDTs
Performance degradation percentage
3
Directory creation
17260.653
856.898
95.04

Probable Cause:
The creation of a child directory on the same MDT as the parent does not force a osd_sync.
The creation of a child directory on a different MDT than the parent triggers an osd_sync of the parent directory.

The directory creation process first checks and cancels the parent directory lock that was previously acquired during a different operation. If the lock was established as part of the previous remote directory creation, it was done so in a protected write mode, necessitating a flush of the underlying directory. However, this cancellation process enforces a synchronization of the underlying parent Metadata Target (MDT) device.

The conditions for enforcing the synchronization path are as follows:

  *   LDLM_CB_CANCELING and BLOCKING_SYNC_ON_CANCEL
  *   l_granted_mode is one of (LCK_EX | LCK_PW | LCK_GROUP)
  *   OBD_CONNECT_MDS_MDS bit set in l_export
Corresponding code links

  *    Link to check the above conditions at  https://github.com/lustre/lustre-release/blob/b2_15/lustre/target/tgt_handler.c#L1336-L1342
  *   The path that invokes the synchronization is at https://github.com/lustre/lustre-release/blob/master/lustre/target/tgt_handler.c#L1381-L1394,  provided that the locks are not taken with the LDLM_STRIPE option.
  *   This entire device synchronization is enforced device sync is at https://github.com/lustre/lustre-release/blob/b2_15/lustre/target/tgt_handler.c#L1288

Experiment:
I did an experiment where I skipped the osd_sync on directory create, and saw the following results:

Using a single client, specifically by creating approximately 1000 child directories with the command lfs mkdir -i 1 .

  *   Creating 1000 child directories on Parent MDT (MDT0) takes ~1.6 sec and
  *   Creating 1000 child directories on remote MDT  (Child directory on MDT1) takes ~3.8 sec

Same test using mdtest with mpirun results:


A
B
C
D
1
Operation
Directory Operations/Sec on DNE filesystem
2

Default
Without osd_sync
Performance improvement percentage
3
Directory creation
856.898
5659.511
560.46

We conducted crash testing with osd_sync disabled, specifically targeting remote directory creation, and observed the following outcomes:

Crash


Filesystem State
Client
MDT0
MDT1

Yes
No
No
Recovered, healthy and could verify the directory tree
No
Yes
No
No
No
Yes
Yes
Yes
No
Yes
No
Yes
No
Yes
Yes
Yes
Yes
Yes
Lost some directory entries

We are trying to understand the implication of disabling osd_sync. The POSIX spec for mkdir does not explicitly require it to be synchronously durable https://pubs.opengroup.org/onlinepubs/9699919799/functions/mkdir.html and pushes the burden to the user to call fsync.

We do though need mkdir to be atomic and not leave partial directory artifacts on one mdt and not another. This is the part where we would like to understand from you if we are breaking the concurrency behavior here.

Proposed Change

diff --git a/lustre/target/tgt_handler.c b/lustre/target/tgt_handler.c
index 33b9863bdc..80948b5f7a 100644
--- a/lustre/target/tgt_handler.c
+++ b/lustre/target/tgt_handler.c
@@ -1333,12 +1333,17 @@ static int tgt_blocking_ast(struct ldlm_lock *lock, struct ldlm_lock_desc *desc,
                RETURN(-EINVAL);
        }

+       //
+       //Proposed Change:
+       //Skip the tgt_sync if the corrspoinding operation is across OSDS and inode is being updated under IBITS lock
+       //
        if (flag == LDLM_CB_CANCELING &&
            (lock->l_granted_mode & (LCK_EX | LCK_PW | LCK_GROUP)) &&
            (tgt->lut_sync_lock_cancel == SYNC_LOCK_CANCEL_ALWAYS ||
             (tgt->lut_sync_lock_cancel == SYNC_LOCK_CANCEL_BLOCKING &&
              ldlm_is_cbpending(lock))) &&
-           ((exp_connect_flags(lock->l_export) & OBD_CONNECT_MDS_MDS) ||
+           (((exp_connect_flags(lock->l_export) & OBD_CONNECT_MDS_MDS) &&
+              lock->l_resource->lr_type != LDLM_IBITS) ||
             lock->l_resource->lr_type == LDLM_EXTENT)) {
                __u64 start = 0;
                __u64 end = OBD_OBJECT_EOF;

_______________________________________________
lustre-discuss mailing list
lustre-discuss at lists.lustre.org<mailto:lustre-discuss at lists.lustre.org>
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20250715/468048c2/attachment-0001.htm>


More information about the lustre-discuss mailing list