<div dir="ltr"><div><div><div>Doesn't PFL also 'solve'/mitigate this issue in the sense that a file doesn't have to remain restricted to the OST(s) it started on?<br></div>(And as such balancing will even continue as files grow)<br></div>Regards,<br></div>Eli<br></div><div class="gmail_extra"><br><div class="gmail_quote">On Fri, Mar 16, 2018 at 9:57 PM, Dilger, Andreas <span dir="ltr"><<a href="mailto:andreas.dilger@intel.com" target="_blank">andreas.dilger@intel.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><span class="">On Mar 15, 2018, at 09:48, Steve Thompson <<a href="mailto:smt@vgersoft.com">smt@vgersoft.com</a>> wrote:<br>

><br>

> Lustre newbie here (1 month). Lustre 2.10.3, CentOS 7.4, ZFS 0.7.5. All networking is 10 GbE.<br>

><br>

> I am building a test Lustre filesystem. So far, I have two OSS's, each with 30 disks of 2 TB each, all in a single zpool per OSS. Everything works well, and was suprisingly easy to build. Thus, two OST's of 60 TB each. File types are comprised of home directories. Clients number about 225 HPC systems (about 2400 cores).<br>

><br>

> In about a month, I will have a third OSS available, and about a month after that, a fourth. Each of these two systems has 48 disks of 4 TB each. I am looking for advice on how best to configure this. If I go with one OST per system (one zpool comprising 8 x 6 RAIDZ2 vdevs), I will have a lustre f/s comprised of two 60 TB OST's and two 192 TB OST's (minus RAIDZ2 overhead). This is obviously a big mismatch between OST sizes. I have not encountered any discussion of the effect of mixing disparate OST sizes. I could instead format two 96 TB OST's on each system (two zpools of 4 x 6 RAIDZ2 vdevs), or three 64 TB OST's, and so on. More OST's means more striping possibilities, but less vdev's per zpool impacts ZFS performance negatively. More OST's per OSS does not help with network bandwidth to the OSS. How would you go about this?<br>

<br>

</span>This is a little bit tricky.  Lustre itself can handle different OST sizes,<br>

as it will run in "QOS allocator" mode (essentially "Quantity of Space", the<br>

full "Quality of Service" was not implemented).  This balances file allocation<br>

across OSTs based on percentage of free space, at the expense of performance<br>

being lower as the only the two new OSTs would be used for 192/252 ~= 75%<br>

of the files, since it isn't possible to *also* use all the OSTs evenly at the<br>

same time (assuming that network speed is your bottleneck, and not disk speed).<br>

<br>

For home directory usage this may not be a significant issue. This performance<br>

imbalance would balance out as the larger OSTs became more full, and would not<br>

be seen when files are striped across all OSTs.<br>

<br>

I also thought about creating 3x OSTs per new OSS, so they would all be about<br>

the same size and allocated equally.  That means the new OSS nodes would see<br>

about 3x as much IO traffic as the old ones, especially for files striped over<br>

all OSTs.  The drawback here is that the performance imbalance would stay<br>

forever, so in the long run I don't think this is as good as just having a<br>

single larger OST.  This will also become less of a factor as more OSTs are<br>

added to the filesystem and/or you eventually upgrade the initial OSTs to<br>

have larger disks and/or more VDEVs.<br>

<br>

<br>

Cheers, Andreas<br>

--<br>

Andreas Dilger<br>

Lustre Principal Architect<br>

Intel Corporation<br>

<div class="HOEnZb"><div class="h5"><br>

<br>

<br>

<br>

<br>

<br>

<br>

______________________________<wbr>_________________<br>

lustre-discuss mailing list<br>

<a href="mailto:lustre-discuss@lists.lustre.org">lustre-discuss@lists.lustre.<wbr>org</a><br>

<a href="http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org" rel="noreferrer" target="_blank">http://lists.lustre.org/<wbr>listinfo.cgi/lustre-discuss-<wbr>lustre.org</a><br>

</div></div></blockquote></div><br></div>