[lustre-discuss] size of MDT, inode count, inode size

E.S. Rosenberg esr+lustre at mail.hebrew.edu
Tue Feb 6 09:27:51 PST 2018


Thanks for the great and informative answer!

On Mon, Feb 5, 2018 at 5:19 AM, Dilger, Andreas <andreas.dilger at intel.com>
wrote:

> On Feb 4, 2018, at 13:10, E.S. Rosenberg <esr+lustre at mail.hebrew.edu>
> wrote:
> > On Sat, Feb 3, 2018 at 4:45 AM, Dilger, Andreas <
> andreas.dilger at intel.com> wrote:
> >> On Jan 26, 2018, at 07:56, Thomas Roth <t.roth at gsi.de> wrote:
> >> >
> >> > Hmm, option-testing leads to more confusion:
> >> >
> >> > With this 922GB-sdb1 I do
> >> >
> >> > mkfs.lustre --reformat --mgs --mdt ... /dev/sdb1
> >> >
> >> > The output of the command says
> >> >
> >> >   Permanent disk data:
> >> > Target:     test0:MDT0000
> >> > ...
> >> >
> >> > device size = 944137MB
> >> > formatting backing filesystem ldiskfs on /dev/sdb1
> >> >       target name   test0:MDT0000
> >> >       4k blocks     241699072
> >> >       options        -J size=4096 -I 1024 -i 2560 -q -O
> dirdata,uninit_bg,^extents,mmp,dir_nlink,quota,huge_file,flex_bg -E
> lazy_journal_init -F
> >> >
> >> > mkfs_cmd = mke2fs -j -b 4096 -L test0:MDT0000  -J size=4096 -I 1024
> -i 2560 -q -O dirdata,uninit_bg,^extents,mmp,dir_nlink,quota,huge_file,flex_bg
> -E lazy_journal_init -F /dev/sdb1 241699072
> >>
> >> The default options have to be conservative, as we don't know in
> advance how a filesystem will be used.  It may be that some sites will have
> lots of hard links or long filenames (which consume directory space ==
> blocks, but not inodes), or they will have widely-striped files (which also
> consume xattr blocks).  The 2KB/inode ratio includes the space for the
> inode itself (512B in 2.7.x 1024B in 2.10), at least one directory entry
> (~64 bytes), some fixed overhead for the journal (up to 4GB on the MDT),
> and Lustre-internal overhead (OI entry = ~64 bytes), ChangeLog, etc.
> >>
> >> If you have a better idea of space usage at your site, you can specify
> different parameters.
> >>
> >> > Mount this as ldiskfs, gives 369 M inodes.
> >> >
> >> > One would assume that specifying one / some of the mke2fs-options
> here in the mkfs.lustre-command will change nothing.
> >> >
> >> > However,
> >> >
> >> > mkfs.lustre --reformat --mgs --mdt ... --mkfsoptions="-I 1024"
> /dev/sdb1
> >> >
> >> > says
> >> >
> >> > device size = 944137MB
> >> > formatting backing filesystem ldiskfs on /dev/sdb1
> >> >       target name   test0:MDT0000
> >> >       4k blocks     241699072
> >> >       options       -I 1024 -J size=4096 -i 1536 -q -O
> dirdata,uninit_bg,^extents,mmp,dir_nlink,quota,huge_file,flex_bg -E
> lazy_journal_init -F
> >> >
> >> > mkfs_cmd = mke2fs -j -b 4096 -L test0:MDT0000 -I 1024 -J size=4096 -i
> 1536 -q -O dirdata,uninit_bg,^extents,mmp,dir_nlink,quota,huge_file,flex_bg
> -E lazy_journal_init -F /dev/sdb1 241699072
> >> >
> >> > and the mounted devices now has 615 M inodes.
> >> >
> >> > So, whatever makes the calculation for the "-i / bytes-per-inode"
> value becomes ineffective if I specify the inode size by hand?
> >>
> >> This is a bit surprising.  I agree that specifying the same inode size
> value as the default should not affect the calculation for the
> bytes-per-inode ratio.
> >>
> >> > How many bytes-per-inode do I need?
> >> >
> >> > This ratio, is it what the manual specifies as "one inode created for
> each 2kB of LUN" ?
> >>
> >> That was true with 512B inodes, but with the increase to 1024B inodes
> in 2.10 (to allow for PFL file layouts, since they are larger) the inode
> ratio has also gone up 512B to 2560B/inode.
> >
> > Does this mean that someone who updates their servers from 2.x to 2.10
> will not be able to use PFL since the MDT was formatted in a way that can't
> support it? (in our case formatted under Lustre 2.5 currently running 2.8)
>
> It will be possible to use PFL layouts with older MDTs, but there may be a
> performance impact if the MDTs are HDD based because a multi-component PFL
> layout is unlikely to fit into the 512-byte inode, so they will allocate an
> extra xattr block for each PFL file.  For SSD-based MDTs the extra seek is
> not likely to impact performance significantly, but for HDD-based MDTs this
> extra seek for accessing every file will reduce the metadata performance.
>
> If you formatted the MDT filesystem for a larger default stripe count (e.g.
> use "mkfs.lustre ... --stripe-count-hint=8" or more) then you will already
> have 1024-byte inodes, and this is a non-issue.
>
> That said, the overall impact to your applications may be minimal if you do
> not have metadata-intensive workloads, and PFL can help improve the IO
> performance of applications because many users do not set proper striping
> on
> their files, so the IO performance of files can improve.
>
> Of course, if you know in advance what the best striping for a file is, and
> your applications or users already use that, then PFL is not necessary and
> there is no performance impact if PFL is not used.
>
> Cheers, Andreas
>
> >> > Perhaps the raw size of an MDT device should better be such that it
> leads
> >> > to "-I 1024 -i 2048"?
> >>
> >> Yes, that is probably reasonable, since the larger inode also means
> that there is less chance of external xattr blocks being allocated.
> >>
> >> Note that with ZFS there is no need to specify the inode ratio at all.
> It will dynamically allocate inode blocks as needed, along with directory
> blocks, OI tables, etc., until the filesystem is full.
> >>
> >> Cheers, Andreas
> >>
> >> > On 01/26/2018 03:10 PM, Thomas Roth wrote:
> >> >> Hi all,
> >> >> what is the relation between raw device size and size of a formatted
> MDT? Size of inodes + free space = raw size?
> >> >> The example:
> >> >> MDT device has 922 GB in /proc/partions.
> >> >> Formatted under Lustre 2.5.3 with default values for mkfs.lustre
> resulted in a 'df -h' MDT of 692G and more importantly 462M inodes.
> >> >> So, the space used for inodes + the 'df -h' output add up to the raw
> size:
> >> >>  462M inodes * 0.5kB/inode + 692 GB = 922 GB
> >> >> On that system there are now 330M files, more than 70% of the
> available inodes.
> >> >> 'df -h' says '692G  191G  456G  30% /srv/mds0'
> >> >> What do I need the remaining 450G for? (Or the ~400G left once all
> the inodes are eaten?)
> >> >> Should the format command not be tuned towards more inodes?
> >> >> Btw, on a Lustre 2.10.2 MDT I get 369M inodes and 550 G space (with
> a 922G raw device): inode size is now 1024.
> >> >> However, according to the manual and various Jira/Ludocs the size
> should be 2k nowadays?
> >> >> Actually, the command within mkfs.lustre reads
> >> >> mke2fs -j -b 4096 -L test0:MDT0000  -J size=4096 -I 1024 -i 2560  -F
> /dev/sdb 241699072
> >> >> -i 2560 ?
> >> >> Cheers,
> >> >> Thomas
>
> Cheers, Andreas
> --
> Andreas Dilger
> Lustre Principal Architect
> Intel Corporation
>
>
>
>
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20180206/b067546f/attachment.html>


More information about the lustre-discuss mailing list