[lustre-discuss] DNE v3 and directory inode changing

Andreas Dilger adilger at whamcloud.com
Thu Mar 23 19:39:33 PDT 2023


The DNE auto-split functionality is disabled by default and not fully completed (e.g. preserve inode numbers) because it had issues with significant performance impact/latency while splitting a directory that was currently in use (which is exactly when you would want to use it), so I wouldn't recommend to use it at this time.

Instead, development efforts were focussed on DNE MDT space balancing.  This adds two different features that allow all of the MDTs in a filesystem to be used without user/admin intervention (though it is still possible to manually create directories on specific MDTs as before).

The "round-robin" MDT selection ("lfs setdirstripe -D --max-depth-rr=N -c 1 -i -1") for top-level directories (enabled for the top 3 levels of the filesystem by default) will, as the name suggests, round robin new directories across all of the available MDTs, when their space is evenly balanced (within 5% free space*inodes by default).  That is important to distribute *new* directories across MDTs in new filesystems when e.g. .../home/$user or .../project/$project or .../scratch/$user are being created.

The "space balance" MDT selection ("lctl set_param lmv.*.qos_threshold_rr=N" on the *CLIENT*) kicks in when MDT space usage becomes imbalanced (free space*inodes difference above 5% by default), and then starts selecting the MDT for *new* directories based on the ratio of free space*inodes.  That allows the MDTs to return toward balance over time, without causing a performance imbalance when it isn't necessary.

Note that both of these heuristics operate on *single-stripe directories* and not regular files, so the MDT balance will not be perfect if some directory tree has millions more files/subdirectories than another.  However, the main issue being avoided is the *very* common case of MDT0000 getting full and MDT0001..N being (almost) totally unused.  These features also make the MDT *usage* balance also pretty good as a result, so it is a win-win.   For most filesystems, the MDT capacity is not the limiting factor (it only makes up a few percent of the total storage).

Cheers, Andreas

On Mar 23, 2023, at 15:31, Bertschinger, Thomas Andrew Hjorth via lustre-discuss <lustre-discuss at lists.lustre.org<mailto:lustre-discuss at lists.lustre.org>> wrote:

Hello,

We've been experimenting with DNEv3 recently and have run into this issue: https://jira.whamcloud.com/browse/LU-7607 where the directory inode number changes after auto-split.

In addition to the problem noted with backups that track the inode number, we have found that file access through a previously open file descriptor is broken post migration. This can occur when a shell's CWD is the affected directory. For example:

mds0 # lctl get_param mdt.mylustre-MDT0000.{dir_split_count,enable_dir_auto_split}
mdt.mylustre-MDT0000.dir_split_count=100
mdt.mylustre-MDT0000.enable_dir_auto_split=1

client $ pwd
/mnt/mylustre/dnetest
client $ for i in {0..100}; do touch file$i; done
client $ ls
ls: cannot open directory '.': Operation not permitted
client $ ls file0
ls: cannot access 'file0': No such file or directory
client $ ls /mnt/mylustre/dnetest/file0
/mnt/mylustre/dnetest/file0

(This is from a build of the current master branch.)

We believe users will certainly encounter this, because users monitor output directories of jobs as they run. Therefore this issue is a dealbreaker with DNEv3 for us.

I wanted to ask about the status of the linked issue, since it looks like it hasn't been updated in a while. Would the resolution to LU-7607 be expected to fix the file access problem I've noted here or will this require additional changes to resolve?

Thanks!

- Thomas Bertschinger
_______________________________________________
lustre-discuss mailing list
lustre-discuss at lists.lustre.org<mailto:lustre-discuss at lists.lustre.org>
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Whamcloud







-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20230324/992f6ab9/attachment.htm>


More information about the lustre-discuss mailing list