[lustre-devel] modern precreate

Sat Jan 9 07:56:35 PST 2021

On Jan 8, 2021, at 12:44, Nathan Rutman <nrutman at gmail.com<mailto:nrutman at gmail.com>> wrote:

Riffing on something Andreas said in a lustre-discuss thread, I'm hoping someone can correct my understanding of how precreate works currently.

Olden days:

MDS would ask each OST for a set of precreated objects via a MDT->OST RPC. These have to be cleaned up during recovery, hence a cap. These were used up as MDS assigned them to layouts, and so MDS has to go back and get more, even for 0-length files.

Modern days, Lustre 2.5+:

MDT doesn't hold a pool of OST objects but instead takes an OST fid range from a FLD server instead. Each MD object has a mapping with an eventual OST object by this fid. The OST side just holds a small number of anonymous objects and assigns the fid to an object when any operation is executed without an existing FID->inode mapping on the OST.There is no more precreate RPC necessary, since OSTs maintain their own pool of anonymous objects and only use them up when data is actually written, and can create more when running low. There is no recovery cleanup needed on the OSTs.
In this case, there should be no performance difference between create and mknod except for the FLD operation, and the number of OSTs should not matter for create rates.

Is my understanding wrong? It clearly must be, since Andreas is still talking OST_CREATE rpc and recovery implications, and we do see a performance difference with mknod and creating files with layouts.

The precreate code still works the same as "the land before the time of FIDs".  Actual objects are still precreated/destroyed on the OSTs. The only difference is that the FID sequences allocated to MDTs allow the OSTs to have different pools of objects for each MDT so that they don't contend/conflict when those MDTs assign the objects to their own inodes.

Having multiple MDTs does "scale" the OST object space, in that there can be more object subdirectories (one per sequence), which improves both the concurrency and the maximum number of objects.  There has also been work done to increase the maximum number of files per directory in ldiskfs, but that doesn't really improve performance.

The patch https://review.whamcloud.com/38424 "LU-11912<https://jira.whamcloud.com/browse/LU-11912> ofd: reduce LUSTRE_DATA_SEQ_MAX_WIDTH" would create smaller object directory trees, and allow "aging" of old objects to be in separate object directory trees from new objects.  That allows old objects to drop out of cache (avoiding one-create-per-leaf as the size of the directory grows very lareg), and keeps fewer "hot" objects densely packed in memory (allowing many new entries to be packed into a single leaf block).

Cheers, Andreas
--
Andreas Dilger
Principal Lustre Architect
Whamcloud

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-devel-lustre.org/attachments/20210109/9d45d677/attachment.html>