[lustre-discuss] size of MDT, inode count, inode size

Sun Feb 4 12:10:40 PST 2018

On Sat, Feb 3, 2018 at 4:45 AM, Dilger, Andreas <andreas.dilger at intel.com>
wrote:

> On Jan 26, 2018, at 07:56, Thomas Roth <t.roth at gsi.de> wrote:
> >
> > Hmm, option-testing leads to more confusion:
> >
> > With this 922GB-sdb1 I do
> >
> > mkfs.lustre --reformat --mgs --mdt ... /dev/sdb1
> >
> > The output of the command says
> >
> >   Permanent disk data:
> > Target:     test0:MDT0000
> > ...
> >
> > device size = 944137MB
> > formatting backing filesystem ldiskfs on /dev/sdb1
> >       target name   test0:MDT0000
> >       4k blocks     241699072
> >       options        -J size=4096 -I 1024 -i 2560 -q -O
> dirdata,uninit_bg,^extents,mmp,dir_nlink,quota,huge_file,flex_bg -E
> lazy_journal_init -F
> >
> > mkfs_cmd = mke2fs -j -b 4096 -L test0:MDT0000  -J size=4096 -I 1024 -i
> 2560 -q -O dirdata,uninit_bg,^extents,mmp,dir_nlink,quota,huge_file,flex_bg
> -E lazy_journal_init -F /dev/sdb1 241699072
>
> The default options have to be conservative, as we don't know in advance
> how a filesystem will be used.  It may be that some sites will have lots of
> hard links or long filenames (which consume directory space == blocks, but
> not inodes), or they will have widely-striped files (which also consume
> xattr blocks).  The 2KB/inode ratio includes the space for the inode itself
> (512B in 2.7.x 1024B in 2.10), at least one directory entry (~64 bytes),
> some fixed overhead for the journal (up to 4GB on the MDT), and
> Lustre-internal overhead (OI entry = ~64 bytes), ChangeLog, etc.
>
> If you have a better idea of space usage at your site, you can specify
> different parameters.
>
> > Mount this as ldiskfs, gives 369 M inodes.
> >
> > One would assume that specifying one / some of the mke2fs-options here
> in the mkfs.lustre-command will change nothing.
> >
> > However,
> >
> > mkfs.lustre --reformat --mgs --mdt ... --mkfsoptions="-I 1024" /dev/sdb1
> >
> > says
> >
> > device size = 944137MB
> > formatting backing filesystem ldiskfs on /dev/sdb1
> >       target name   test0:MDT0000
> >       4k blocks     241699072
> >       options       -I 1024 -J size=4096 -i 1536 -q -O
> dirdata,uninit_bg,^extents,mmp,dir_nlink,quota,huge_file,flex_bg -E
> lazy_journal_init -F
> >
> > mkfs_cmd = mke2fs -j -b 4096 -L test0:MDT0000 -I 1024 -J size=4096 -i
> 1536 -q -O dirdata,uninit_bg,^extents,mmp,dir_nlink,quota,huge_file,flex_bg
> -E lazy_journal_init -F /dev/sdb1 241699072
> >
> > and the mounted devices now has 615 M inodes.
> >
> > So, whatever makes the calculation for the "-i / bytes-per-inode" value
> becomes ineffective if I specify the inode size by hand?
>
> This is a bit surprising.  I agree that specifying the same inode size
> value as the default should not affect the calculation for the
> bytes-per-inode ratio.
>
> > How many bytes-per-inode do I need?
> >
> > This ratio, is it what the manual specifies as "one inode created for
> each 2kB of LUN" ?
>
> That was true with 512B inodes, but with the increase to 1024B inodes in
> 2.10 (to allow for PFL file layouts, since they are larger) the inode ratio
> has also gone up 512B to 2560B/inode.
>
Does this mean that someone who updates their servers from 2.x to 2.10 will
not be able to use PFL since the MDT was formatted in a way that can't
support it? (in our case formatted under Lustre 2.5 currently running 2.8)
Thanks,
Eli

>
> > Perhaps the raw size of an MDT device should better be such that it
> leads to "-I 1024 -i 2048"?
>
> Yes, that is probably reasonable, since the larger inode also means that
> there is less chance of external xattr blocks being allocated.
>
> Note that with ZFS there is no need to specify the inode ratio at all.  It
> will dynamically allocate inode blocks as needed, along with directory
> blocks, OI tables, etc., until the filesystem is full.
>
> Cheers, Andreas
>
> > On 01/26/2018 03:10 PM, Thomas Roth wrote:
> >> Hi all,
> >> what is the relation between raw device size and size of a formatted
> MDT? Size of inodes + free space = raw size?
> >> The example:
> >> MDT device has 922 GB in /proc/partions.
> >> Formatted under Lustre 2.5.3 with default values for mkfs.lustre
> resulted in a 'df -h' MDT of 692G and more importantly 462M inodes.
> >> So, the space used for inodes + the 'df -h' output add up to the raw
> size:
> >>  462M inodes * 0.5kB/inode + 692 GB = 922 GB
> >> On that system there are now 330M files, more than 70% of the available
> inodes.
> >> 'df -h' says '692G  191G  456G  30% /srv/mds0'
> >> What do I need the remaining 450G for? (Or the ~400G left once all the
> inodes are eaten?)
> >> Should the format command not be tuned towards more inodes?
> >> Btw, on a Lustre 2.10.2 MDT I get 369M inodes and 550 G space (with a
> 922G raw device): inode size is now 1024.
> >> However, according to the manual and various Jira/Ludocs the size
> should be 2k nowadays?
> >> Actually, the command within mkfs.lustre reads
> >> mke2fs -j -b 4096 -L test0:MDT0000  -J size=4096 -I 1024 -i 2560  -F
> /dev/sdb 241699072
> >> -i 2560 ?
> >> Cheers,
> >> Thomas
> >
> > -
> > _______________________________________________
> > lustre-discuss mailing list
> > lustre-discuss at lists.lustre.org
> > http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>
> Cheers, Andreas
> --
> Andreas Dilger
> Lustre Principal Architect
> Intel Corporation
>
>
>
>
>
>
>
> _______________________________________________
> lustre-discuss mailing list
> lustre-discuss at lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20180204/6d3776f8/attachment.html>