[lustre-discuss] changing inode size on MDT

Thu Oct 3 18:38:18 PDT 2019

On Oct 3, 2019, at 05:03, Hebenstreit, Michael <michael.hebenstreit at intel.com<mailto:michael.hebenstreit at intel.com>> wrote:

So you are saying on a zfs based Lustre there is no way to increase the number of available inodes? I have 8TB MDT with roughly 17G inodes

[root at elfsa1m1 ~]# df -h
Filesystem       Size  Used Avail Use% Mounted on
mdt0000          8.3T  256K  8.3T   1% /mdt0000

[root at elfsa1m1 ~]# df -i
Filesystem           Inodes  IUsed       IFree IUse% Mounted on
mdt0000         17678817874      6 17678817868    1% /mdt0000

For ZFS the only way to increase inodes on the *MDT* is to increase the size of the MDT, though more on that below.  Note that the "number of inodes" reported by ZFS is an estimate based on the currently-allocated blocks and inodes (i.e. bytes_per_inode_ratio = bytes_used / inodes_used, total inode estimate = bytes_free / inode_ratio + inodes_used), which becomes more accurate as the MDT becomes more full.  With 17B inodes on a 8TB MDT that is an bytes-per-inode ratio of 497, which is unrealistically low for Lustre since the MDT will always stores multiple xattrs on each inode.  Note that the filesystem only has 6 inodes allocated, so the ZFS total inodes estimate is unrealistically high and will get better as more inodes are allocated in the filesystem.

Formating under Lustre 2.10.8

mkfs.lustre --mdt --backfstype=zfs --fsname=lfsarc01 --index=0 --mgsnid="36.101.92.22 at tcp" --reformat mdt0000/mdt0000

this translates to only 948M inodes on the Lustre FS.

[root at elfsa1m1 ~]# df -i
Filesystem           Inodes  IUsed       IFree IUse% Mounted on
mdt0000         17678817874      6 17678817868    1% /mdt0000
mdt0000/mdt0000   948016092    263   948015829    1% /lfs/lfsarc01/mdt

[root at elfsa1m1 ~]# df -h
Filesystem       Size  Used Avail Use% Mounted on
mdt0000          8.3T  256K  8.3T   1% /mdt0000
mdt0000/mdt0000  8.2T   24M  8.2T   1% /lfs/lfsarc01/mdt

and there is no reasonable option to provide more file entries except for adding another MDT?

The Lustre statfs code will weight in some initial estimates for the bytes-per-inode ratio when computing the total inode estimate for the filesystem.  When the filesystem is nearly empty, as is the case here, then those initial estimates will dominate, but once you've allocated a few thousand inodes in the filesystem the actual values will dominate and you will have a much more accurate number for the total inode count.  This will probably be more in the range of 2B-4B inodes in the end, unless you also use Data-on-MDT (Lustre 2.11 and later) to store small files directly on the MDT.

You've also excluded the OST lines from the above output?  For the Lustre filesystem you (typically) also need at least one OST inode (object) for each file in the filesystem, possibly more than one, so "df" of the Lustre filesystem may also be limited by the number of inodes reported by the OSTs (which may themselves depend on the average bytes-per-inode for files stored on the OST).  If you use Data-on-MDT and only have a small files, then no OST object is needed for small files, but you consume correspondingly more space on the MDT.

Cheers, Andreas

From: Andreas Dilger <adilger at whamcloud.com<mailto:adilger at whamcloud.com>>
Sent: Wednesday, October 02, 2019 18:49
To: Hebenstreit, Michael <michael.hebenstreit at intel.com<mailto:michael.hebenstreit at intel.com>>
Cc: Mohr Jr, Richard Frank <rmohr at utk.edu<mailto:rmohr at utk.edu>>; lustre-discuss at lists.lustre.org<mailto:lustre-discuss at lists.lustre.org>
Subject: Re: [lustre-discuss] changing inode size on MDT

There are several confusing/misleading comments on this thread that need to be clarified...

On Oct 2, 2019, at 13:45, Hebenstreit, Michael <michael.hebenstreit at intel.com<mailto:michael.hebenstreit at intel.com>> wrote:

http://wiki.lustre.org/Lustre_Tuning#Number_of_Inodes_for_MDS

Note that I've updated this page to reflect current defaults.  The Lustre Operations Manual has a much better description of these parameters.

and I'd like to use --mkfsoptions='-i 1024' to have more inodes in the MDT. We already run out of inodes on that FS (probably due to an ZFS bug in early IEEL version) - so I'd like to increase #inodes if possible.

The "-i 1024" option (bytes-per-inode ratio) is only needed for ldiskfs since it statically allocates the inodes at mkfs time, it is not relevant for ZFS since ZFS dynamically allocates inodes and blocks as needed.

On Oct 2, 2019, at 14:00, Colin Faber <cfaber at gmail.com<mailto:cfaber at gmail.com>> wrote:
With 1K inodes you won't have space to accommodate new features, IIRC the current minimal limit on modern lustre is 2K now. If you're running out of MDT space you might consider DNE and multiple MDT's to accommodate that larger name space.

To clarify, since Lustre 2.10 any new ldiskfs MDT will allocate 1024 bytes for the inode itself (-I 1024).  That allows enough space *within* the inode to efficiently store xattrs for more complex layouts (PFL, FLR, DoM).  If xattrs do not fit inside the inode itself then they will be stored in an external 4KB inode block.

The MDT is formatted with a bytes-per-inode *ratio* of 2.5KB, which means (approximately) one inode will be created for every 2.5kB of the total MDT size.  That 2.5KB of space includes the 1KB for the inode itself, plus space for a directory entry (or multiple if hard-linked), extra xattrs, the journal (up to 4GB for large MDTs), Lustre recovery logs, ChangeLogs, etc.  Each directory inode will have at least one 4KB block allocated.

So, it is _possible_ to reduce the inode *ratio* below 2.5KB if you know what you are doing (e.g. 2KB/inode or 1.5KB/inode, this can be an arbitrary number of bytes, it doesn't have to be an even multiple of anything) but it definitely isn't possible to have 1KB inode size and 1KB per inode ratio, as there wouldn't be *any* space left for directories, log files, journal, etc.

Cheers, Andreas
--
Andreas Dilger
Principal Lustre Architect
Whamcloud

Cheers, Andreas
--
Andreas Dilger
Principal Lustre Architect
Whamcloud

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20191004/2c74c263/attachment-0001.html>