<div dir="ltr"><div>Thanks for the great and informative answer!<br></div></div><div class="gmail_extra"><br><div class="gmail_quote">On Mon, Feb 5, 2018 at 5:19 AM, Dilger, Andreas <span dir="ltr"><<a href="mailto:andreas.dilger@intel.com" target="_blank">andreas.dilger@intel.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div class="HOEnZb"><div class="h5">On Feb 4, 2018, at 13:10, E.S. Rosenberg <<a href="mailto:esr%2Blustre@mail.hebrew.edu">esr+lustre@mail.hebrew.edu</a>> wrote:<br>

> On Sat, Feb 3, 2018 at 4:45 AM, Dilger, Andreas <<a href="mailto:andreas.dilger@intel.com">andreas.dilger@intel.com</a>> wrote:<br>

>> On Jan 26, 2018, at 07:56, Thomas Roth <<a href="mailto:t.roth@gsi.de">t.roth@gsi.de</a>> wrote:<br>

>> ><br>

>> > Hmm, option-testing leads to more confusion:<br>

>> ><br>

>> > With this 922GB-sdb1 I do<br>

>> ><br>

>> > mkfs.lustre --reformat --mgs --mdt ... /dev/sdb1<br>

>> ><br>

>> > The output of the command says<br>

>> ><br>

>> >   Permanent disk data:<br>

>> > Target:     test0:MDT0000<br>

>> > ...<br>

>> ><br>

>> > device size = 944137MB<br>

>> > formatting backing filesystem ldiskfs on /dev/sdb1<br>

>> >       target name   test0:MDT0000<br>

>> >       4k blocks     241699072<br>

>> >       options        -J size=4096 -I 1024 -i 2560 -q -O dirdata,uninit_bg,^extents,<wbr>mmp,dir_nlink,quota,huge_file,<wbr>flex_bg -E lazy_journal_init -F<br>

>> ><br>

>> > mkfs_cmd = mke2fs -j -b 4096 -L test0:MDT0000  -J size=4096 -I 1024 -i 2560 -q -O dirdata,uninit_bg,^extents,<wbr>mmp,dir_nlink,quota,huge_file,<wbr>flex_bg -E lazy_journal_init -F /dev/sdb1 241699072<br>

>><br>

>> The default options have to be conservative, as we don't know in advance how a filesystem will be used.  It may be that some sites will have lots of hard links or long filenames (which consume directory space == blocks, but not inodes), or they will have widely-striped files (which also consume xattr blocks).  The 2KB/inode ratio includes the space for the inode itself (512B in 2.7.x 1024B in 2.10), at least one directory entry (~64 bytes), some fixed overhead for the journal (up to 4GB on the MDT), and Lustre-internal overhead (OI entry = ~64 bytes), ChangeLog, etc.<br>

>><br>

>> If you have a better idea of space usage at your site, you can specify different parameters.<br>

>><br>

>> > Mount this as ldiskfs, gives 369 M inodes.<br>

>> ><br>

>> > One would assume that specifying one / some of the mke2fs-options here in the mkfs.lustre-command will change nothing.<br>

>> ><br>

>> > However,<br>

>> ><br>

>> > mkfs.lustre --reformat --mgs --mdt ... --mkfsoptions="-I 1024" /dev/sdb1<br>

>> ><br>

>> > says<br>

>> ><br>

>> > device size = 944137MB<br>

>> > formatting backing filesystem ldiskfs on /dev/sdb1<br>

>> >       target name   test0:MDT0000<br>

>> >       4k blocks     241699072<br>

>> >       options       -I 1024 -J size=4096 -i 1536 -q -O dirdata,uninit_bg,^extents,<wbr>mmp,dir_nlink,quota,huge_file,<wbr>flex_bg -E lazy_journal_init -F<br>

>> ><br>

>> > mkfs_cmd = mke2fs -j -b 4096 -L test0:MDT0000 -I 1024 -J size=4096 -i 1536 -q -O dirdata,uninit_bg,^extents,<wbr>mmp,dir_nlink,quota,huge_file,<wbr>flex_bg -E lazy_journal_init -F /dev/sdb1 241699072<br>

>> ><br>

>> > and the mounted devices now has 615 M inodes.<br>

>> ><br>

>> > So, whatever makes the calculation for the "-i / bytes-per-inode" value becomes ineffective if I specify the inode size by hand?<br>

>><br>

>> This is a bit surprising.  I agree that specifying the same inode size value as the default should not affect the calculation for the bytes-per-inode ratio.<br>

>><br>

>> > How many bytes-per-inode do I need?<br>

>> ><br>

>> > This ratio, is it what the manual specifies as "one inode created for each 2kB of LUN" ?<br>

>><br>

>> That was true with 512B inodes, but with the increase to 1024B inodes in 2.10 (to allow for PFL file layouts, since they are larger) the inode ratio has also gone up 512B to 2560B/inode.<br>

><br>

> Does this mean that someone who updates their servers from 2.x to 2.10 will not be able to use PFL since the MDT was formatted in a way that can't support it? (in our case formatted under Lustre 2.5 currently running 2.8)<br>

<br>

</div></div>It will be possible to use PFL layouts with older MDTs, but there may be a<br>

performance impact if the MDTs are HDD based because a multi-component PFL<br>

layout is unlikely to fit into the 512-byte inode, so they will allocate an<br>

extra xattr block for each PFL file.  For SSD-based MDTs the extra seek is<br>

not likely to impact performance significantly, but for HDD-based MDTs this<br>

extra seek for accessing every file will reduce the metadata performance.<br>

<br>

If you formatted the MDT filesystem for a larger default stripe count (e.g.<br>

use "mkfs.lustre ... --stripe-count-hint=8" or more) then you will already<br>

have 1024-byte inodes, and this is a non-issue.<br>

<br>

That said, the overall impact to your applications may be minimal if you do<br>

not have metadata-intensive workloads, and PFL can help improve the IO<br>

performance of applications because many users do not set proper striping on<br>

their files, so the IO performance of files can improve.<br>

<br>

Of course, if you know in advance what the best striping for a file is, and<br>

your applications or users already use that, then PFL is not necessary and<br>

there is no performance impact if PFL is not used.<br>

<br>

Cheers, Andreas<br>

<span class="im HOEnZb"><br>

>> > Perhaps the raw size of an MDT device should better be such that it leads<br>

>> > to "-I 1024 -i 2048"?<br>

>><br>

>> Yes, that is probably reasonable, since the larger inode also means that there is less chance of external xattr blocks being allocated.<br>

>><br>

>> Note that with ZFS there is no need to specify the inode ratio at all.  It will dynamically allocate inode blocks as needed, along with directory blocks, OI tables, etc., until the filesystem is full.<br>

>><br>

>> Cheers, Andreas<br>

>><br>

>> > On 01/26/2018 03:10 PM, Thomas Roth wrote:<br>

>> >> Hi all,<br>

>> >> what is the relation between raw device size and size of a formatted MDT? Size of inodes + free space = raw size?<br>

>> >> The example:<br>

>> >> MDT device has 922 GB in /proc/partions.<br>

>> >> Formatted under Lustre 2.5.3 with default values for mkfs.lustre resulted in a 'df -h' MDT of 692G and more importantly 462M inodes.<br>

>> >> So, the space used for inodes + the 'df -h' output add up to the raw size:<br>

>> >>  462M inodes * 0.5kB/inode + 692 GB = 922 GB<br>

>> >> On that system there are now 330M files, more than 70% of the available inodes.<br>

>> >> 'df -h' says '692G  191G  456G  30% /srv/mds0'<br>

>> >> What do I need the remaining 450G for? (Or the ~400G left once all the inodes are eaten?)<br>

>> >> Should the format command not be tuned towards more inodes?<br>

>> >> Btw, on a Lustre 2.10.2 MDT I get 369M inodes and 550 G space (with a 922G raw device): inode size is now 1024.<br>

>> >> However, according to the manual and various Jira/Ludocs the size should be 2k nowadays?<br>

>> >> Actually, the command within mkfs.lustre reads<br>

>> >> mke2fs -j -b 4096 -L test0:MDT0000  -J size=4096 -I 1024 -i 2560  -F /dev/sdb 241699072<br>

>> >> -i 2560 ?<br>

>> >> Cheers,<br>

>> >> Thomas<br>

<br>

</span><div class="HOEnZb"><div class="h5">Cheers, Andreas<br>

--<br>

Andreas Dilger<br>

Lustre Principal Architect<br>

Intel Corporation<br>

<br>

<br>

<br>

<br>

<br>

<br>

<br>

</div></div></blockquote></div><br></div>