[lustre-discuss] DNE mkdir performance

Rungta, Vandana vrungta at amazon.com
Fri Jul 11 13:58:50 PDT 2025


Summary:
When creating a directory (mkdir), lustre does not “sync” by default when there is a single mdt.  With multiple mdts where the child directory is a created on a different mdt than the parent (cross mdt mkdir), lustre does an osd_sync, which we suspect is for atomicity. Our experiments show that if we disable the osd_sync in the cross-mdt case, we don’t lose atomicity and system recovers if any one of the 3 hosts involved is available (similar to the single mdt case) So, we are wondering if this “osd-sync” is needed in the cross-mdt case, as the call to sync degrades performance.

Issue:
In a Lustre Distributed Namespace Environment (DNE) featuring multiple Metadata Targets (MDTs), the process of creating remote directories is notably slower compared to a single MDT file system utilizing the osd-zfs backend.

This performance issue can be consistently replicated using a single client, specifically by creating approximately 1000 child directories with the command lfs mkdir -i 1 . The parent directory is part of MDT-0, while the child directories are created on MDT-1, following a pattern such as /parent/child-0, /parent/child-1, etc.

  *   Creating 1000 child directories on Parent MDT (MDT0) takes ~0.9 sec and
  *   Creating 1000 child directories on remote MDT  (parent directory on MDT0, and child directory on MDT1) takes ~12 sec
Testing using mdtest with mpirun involving two clients and 50 iterations,  directories are generated in a round-robin fashion to utilize both MDTs, as demonstrated by the command "mpirun -mca routed direct -map-by node -np 16 mdtest -n 625 -i 50 -u -d /lfs/mdtest".


A
B
C
D
1
Operation
Directory Operations/Sec
2

With Single MDT
With 2 MDTs
Performance degradation percentage
3
Directory creation
17260.653
856.898
95.04

Probable Cause:
The creation of a child directory on the same MDT as the parent does not force a osd_sync.
The creation of a child directory on a different MDT than the parent triggers an osd_sync of the parent directory.

The directory creation process first checks and cancels the parent directory lock that was previously acquired during a different operation. If the lock was established as part of the previous remote directory creation, it was done so in a protected write mode, necessitating a flush of the underlying directory. However, this cancellation process enforces a synchronization of the underlying parent Metadata Target (MDT) device.

The conditions for enforcing the synchronization path are as follows:

  *   LDLM_CB_CANCELING and BLOCKING_SYNC_ON_CANCEL
  *   l_granted_mode is one of (LCK_EX | LCK_PW | LCK_GROUP)
  *   OBD_CONNECT_MDS_MDS bit set in l_export
Corresponding code links

  *    Link to check the above conditions at  https://github.com/lustre/lustre-release/blob/b2_15/lustre/target/tgt_handler.c#L1336-L1342
  *   The path that invokes the synchronization is at https://github.com/lustre/lustre-release/blob/master/lustre/target/tgt_handler.c#L1381-L1394,  provided that the locks are not taken with the LDLM_STRIPE option.
  *   This entire device synchronization is enforced device sync is at https://github.com/lustre/lustre-release/blob/b2_15/lustre/target/tgt_handler.c#L1288

Experiment:
I did an experiment where I skipped the osd_sync on directory create, and saw the following results:

Using a single client, specifically by creating approximately 1000 child directories with the command lfs mkdir -i 1 .

  *   Creating 1000 child directories on Parent MDT (MDT0) takes ~1.6 sec and
  *   Creating 1000 child directories on remote MDT  (Child directory on MDT1) takes ~3.8 sec

Same test using mdtest with mpirun results:


A
B
C
D
1
Operation
Directory Operations/Sec on DNE filesystem
2

Default
Without osd_sync
Performance improvement percentage
3
Directory creation
856.898
5659.511
560.46

We conducted crash testing with osd_sync disabled, specifically targeting remote directory creation, and observed the following outcomes:

Crash


Filesystem State
Client
MDT0
MDT1

Yes
No
No
Recovered, healthy and could verify the directory tree
No
Yes
No
No
No
Yes
Yes
Yes
No
Yes
No
Yes
No
Yes
Yes
Yes
Yes
Yes
Lost some directory entries

We are trying to understand the implication of disabling osd_sync. The POSIX spec for mkdir does not explicitly require it to be synchronously durable https://pubs.opengroup.org/onlinepubs/9699919799/functions/mkdir.html and pushes the burden to the user to call fsync.

We do though need mkdir to be atomic and not leave partial directory artifacts on one mdt and not another. This is the part where we would like to understand from you if we are breaking the concurrency behavior here.

Proposed Change

diff --git a/lustre/target/tgt_handler.c b/lustre/target/tgt_handler.c
index 33b9863bdc..80948b5f7a 100644
--- a/lustre/target/tgt_handler.c
+++ b/lustre/target/tgt_handler.c
@@ -1333,12 +1333,17 @@ static int tgt_blocking_ast(struct ldlm_lock *lock, struct ldlm_lock_desc *desc,
                RETURN(-EINVAL);
        }

+       //
+       //Proposed Change:
+       //Skip the tgt_sync if the corrspoinding operation is across OSDS and inode is being updated under IBITS lock
+       //
        if (flag == LDLM_CB_CANCELING &&
            (lock->l_granted_mode & (LCK_EX | LCK_PW | LCK_GROUP)) &&
            (tgt->lut_sync_lock_cancel == SYNC_LOCK_CANCEL_ALWAYS ||
             (tgt->lut_sync_lock_cancel == SYNC_LOCK_CANCEL_BLOCKING &&
              ldlm_is_cbpending(lock))) &&
-           ((exp_connect_flags(lock->l_export) & OBD_CONNECT_MDS_MDS) ||
+           (((exp_connect_flags(lock->l_export) & OBD_CONNECT_MDS_MDS) &&
+              lock->l_resource->lr_type != LDLM_IBITS) ||
             lock->l_resource->lr_type == LDLM_EXTENT)) {
                __u64 start = 0;
                __u64 end = OBD_OBJECT_EOF;

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20250711/79ba645f/attachment-0001.htm>


More information about the lustre-discuss mailing list