<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=us-ascii">
</head>
<body style="word-wrap: break-word; -webkit-nbsp-mode: space; line-break: after-white-space;" class="">
On Apr 16, 2022, at 22:51, Finn Rawles Malliagh via lustre-discuss <<a href="mailto:lustre-discuss@lists.lustre.org" class="">lustre-discuss@lists.lustre.org</a>> wrote:<br class="">
<div>
<blockquote type="cite" class=""><br class="Apple-interchange-newline">
<div class="">
<div dir="ltr" class="">Hi all,
<div class=""><br class="">
</div>
<div class="">I have just set up a three-node Lustre configuration, and initial testing shows what I think are slow results. The current configuration is 2 OSS, 1 MDS-MGS; each OSS/MGS has 4x Intel P3600, 1x Intel P4800, Intel E810 100Gbe eth, 2x 6252, 380GB
dram</div>
<div class="">I am using Lustre 2.12.8, ZFS 0.7.13, ice-1.8.3, rdma-core-35.0 (RoCEv2 is enabled)</div>
<div class="">All zpools are setup identical for OST1, OST2, and MDT1</div>
<div class=""><br class="">
</div>
<div class="">[root@stor3 ~]# zpool status<br class="">
pool: osstank<br class="">
state: ONLINE<br class="">
scan: none requested<br class="">
config:<br class="">
NAME STATE READ WRITE CKSUM<br class="">
osstank ONLINE 0 0 0<br class="">
nvme1n1 ONLINE 0 0 0<br class="">
nvme2n1 ONLINE 0 0 0<br class="">
nvme3n1 ONLINE 0 0 0<br class="">
cache<br class="">
nvme0n1 ONLINE 0 0 0<br class="">
</div>
</div>
</div>
</blockquote>
<div><br class="">
</div>
It's been a while since I've done anything with ZFS, but I see a few potential issues here:</div>
<div>- firstly, it doesn't make sense IMHO to have an NVMe cache device when the main storage</div>
<div> pool is also NVMe. You could better use that capacity/bandwidth for storing more data</div>
<div> instead of duplicating it into the cache device. Also, Lustre cannot use the ZIL.</div>
<div>- in general ZFS is not very good at IOPS workloads because of the high overhead per block.</div>
<div> Lustre can't use the ZIL, so no opportunity to accelerate heavy IOPS workloads.</div>
<div><br class="">
<blockquote type="cite" class="">
<div class="">
<div dir="ltr" class="">
<div class="">When running "./io500 ./config-minimalLUST.ini" on my lustre client, I get these performance numbers:</div>
<div class="">IO500 version io500-isc22_v1 (standard)<br class="">
[RESULT] ior-easy-write 1.173435 GiB/s : time 31.703 seconds [INVALID]<br class="">
[RESULT] ior-hard-write 0.821624 GiB/s : time 1.070 seconds [INVALID]</div>
</div>
</div>
</blockquote>
<blockquote type="cite" class="">
<div dir="ltr" class="">
<div class="">[RESULT] ior-easy-read 5.177930 GiB/s : time 7.187 seconds<br class="">
</div>
</div>
</blockquote>
<blockquote type="cite" class="">
<div dir="ltr" class="">
<div class="">[RESULT] ior-hard-read 5.331791 GiB/s : time 0.167 seconds<br class="">
</div>
</div>
</blockquote>
<div><br class="">
</div>
<div>
<blockquote type="cite" class="">
<div dir="ltr" class="">
<div class="">When running "./io500 ./config-minimalLOCAL.ini" on a singular locally mounted ZFS pool I get the following performance numbers:<br class="">
</div>
<div class="">IO500 version io500-isc22_v1 (standard)<br class="">
[RESULT] ior-easy-write 1.304500 GiB/s : time 33.302 seconds [INVALID]<br class="">
</div>
</div>
</blockquote>
<div>
<blockquote type="cite" class="">
<div dir="ltr" class="">
<div class="">[RESULT] ior-hard-write 0.485283 GiB/s : time 1.806 seconds [INVALID]<br class="">
</div>
</div>
</blockquote>
</div>
<blockquote type="cite" class="">
<div dir="ltr" class="">
<div class="">[RESULT] ior-easy-read 3.078668 GiB/s : time 14.111 seconds<br class="">
</div>
</div>
</blockquote>
<blockquote type="cite" class="">
<div dir="ltr" class="">
<div class="">[RESULT] ior-hard-read 3.183521 GiB/s : time 0.275 seconds</div>
</div>
</blockquote>
<br class="">
</div>
<div>There are definitely some file layout tunables that can improve IO500 performance for these workloads.</div>
<div>See the default io500.sh file, where they are commented out by default:</div>
<div><br class="">
</div>
<div> # Example commands to create output directories for Lustre. Creating<br class="">
# top-level directories is allowed, but not the whole directory tree.<br class="">
#if (( $(lfs df $workdir | grep -c MDT) > 1 )); then<br class="">
# lfs setdirstripe -D -c -1 $workdir<br class="">
#fi<br class="">
#lfs setstripe -c 1 $workdir<br class="">
#mkdir $workdir/ior-easy $workdir/ior-hard<br class="">
#mkdir $workdir/mdtest-easy $workdir/mdtest-hard<br class="">
#local osts=$(lfs df $workdir | grep -c OST)<br class="">
# Try overstriping for ior-hard to improve scaling, or use wide striping<br class="">
#lfs setstripe -C $((osts * 4)) $workdir/ior-hard ||<br class="">
# lfs setstripe -c -1 $workdir/ior-hard<br class="">
# Try to use DoM if available, otherwise use default for small files<br class="">
#lfs setstripe -E 64k -L mdt $workdir/mdtest-easy || true #DoM?<br class="">
#lfs setstripe -E 64k -L mdt $workdir/mdtest-hard || true #DoM?<br class="">
#lfs setstripe -E 64k -L mdt $workdir/mdtest-rnd<br class="">
<br class="">
</div>
<div><br class="">
</div>
<div>As you can see above, the IO performance of Lustre isn't really much different than the local storage</div>
<div>performance of ZFS. You are always going to lose some percentage over the network and because</div>
<div>of the added distributed locking. That said, for the hardware that you have, it should be getting about</div>
<div>2-3GB/s per NVMe device, and up to 10GB/s over the network, so the limitation here is really ZFS.</div>
<div>It would be useful to test with ldiskfs on tje same hardware, maybe with LVM aggregating the NVMes.</div>
<div><br class="">
</div>
<div>
<blockquote type="cite" class="">
<div dir="ltr" class="">
<div class="">When running "./io500 ./config-minimalLUST.ini" on my lustre client, I get these performance numbers:</div>
<div class="">IO500 version io500-isc22_v1 (standard)<br class="">
</div>
</div>
</blockquote>
</div>
<blockquote type="cite" class="">
<div class="">
<div dir="ltr" class="">
<div class="">[RESULT] mdtest-easy-write 0.931693 kIOPS : time 31.028 seconds [INVALID]</div>
</div>
</div>
</blockquote>
<blockquote type="cite" class="">
<div class="">
<div dir="ltr" class="">
<div class="">[RESULT] mdtest-hard-write 0.427000 kIOPS : time 31.070 seconds [INVALID]<br class="">
[RESULT] find 25.311534 kIOPS : time 1.631 seconds<br class="">
[RESULT] mdtest-easy-stat 0.570021 kIOPS : time 50.067 seconds<br class="">
[RESULT] mdtest-hard-stat 1.834985 kIOPS : time 7.998 seconds<br class="">
[RESULT] mdtest-easy-delete 1.715750 kIOPS : time 17.308 seconds<br class="">
[RESULT] mdtest-hard-read 1.006240 kIOPS : time 13.759 seconds<br class="">
[RESULT] mdtest-hard-delete 1.624117 kIOPS : time 8.910 seconds<br class="">
[SCORE ] Bandwidth 2.271383 GiB/s : IOPS 1.526825 kiops : TOTAL 1.862258 [INVALID]<br class="">
</div>
<div class=""><br class="">
</div>
<div class="">When running "./io500 ./config-minimalLOCAL.ini" on a singular locally mounted ZFS pool I get the following performance numbers:<br class="">
</div>
<div class="">IO500 version io500-isc22_v1 (standard)</div>
</div>
</div>
</blockquote>
<blockquote type="cite" class="">
<div class="">
<div dir="ltr" class="">
<div class="">[RESULT] mdtest-easy-write 47.979181 kIOPS : time 1.838 seconds [INVALID]<br class="">
[RESULT] mdtest-hard-write 27.801814 kIOPS : time 2.443 seconds [INVALID]<br class="">
[RESULT] find 1384.774433 kIOPS : time 0.074 seconds<br class="">
[RESULT] mdtest-easy-stat 343.232733 kIOPS : time 1.118 seconds<br class="">
[RESULT] mdtest-hard-stat 333.241620 kIOPS : time 1.123 seconds<br class="">
[RESULT] mdtest-easy-delete 45.723381 kIOPS : time 1.884 seconds<br class="">
[RESULT] mdtest-hard-read 73.637312 kIOPS : time 1.546 seconds<br class="">
[RESULT] mdtest-hard-delete 42.191867 kIOPS : time 1.956 seconds<br class="">
[SCORE ] Bandwidth 1.578256 GiB/s : IOPS 114.726763 kiops : TOTAL 13.456159 [INVALID]<br class="">
</div>
</div>
</div>
</blockquote>
<div><br class="">
</div>
Definitely the metadata performance is lower here, because each Lustre file has to create (at least)</div>
<div>two objects (one on MDT, one or more on OST(s)) and then write and access them again.</div>
<div>Lustre metadata performance would definitely benefit from enabling PFL and Data-on-MDT (per</div>
<div>above default commands), since it only needs to do the MDT create/access.</div>
<div><br class="">
<blockquote type="cite" class="">
<div dir="ltr" class="">
<div class="">I have run an iperf3 test and I was able to reach speeds of around 40GbE so I don't think the network links are the issue (Maybe it's something to do with lnet?)</div>
<div class=""><br class="">
</div>
<div class="">
<div class="">If anyone more knowledgeable than me would please educate me on why the performance of the local three disk ZFS is more performant than the lustre FS.</div>
<div class="">I'm very new to this kind of benchmarking so it may also be that I am misinterpreting the results/ not applying the test correctly.</div>
</div>
</div>
</blockquote>
</div>
<div><br class="">
</div>
<br class="">
<div class="">
<div dir="auto" style="caret-color: rgb(0, 0, 0); color: rgb(0, 0, 0); letter-spacing: normal; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration: none; word-wrap: break-word; -webkit-nbsp-mode: space; line-break: after-white-space;" class="">
<div dir="auto" style="caret-color: rgb(0, 0, 0); color: rgb(0, 0, 0); letter-spacing: normal; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration: none; word-wrap: break-word; -webkit-nbsp-mode: space; line-break: after-white-space;" class="">
<div dir="auto" style="caret-color: rgb(0, 0, 0); color: rgb(0, 0, 0); letter-spacing: normal; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration: none; word-wrap: break-word; -webkit-nbsp-mode: space; line-break: after-white-space;" class="">
<div dir="auto" style="caret-color: rgb(0, 0, 0); color: rgb(0, 0, 0); letter-spacing: normal; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration: none; word-wrap: break-word; -webkit-nbsp-mode: space; line-break: after-white-space;" class="">
<div dir="auto" style="caret-color: rgb(0, 0, 0); color: rgb(0, 0, 0); letter-spacing: normal; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration: none; word-wrap: break-word; -webkit-nbsp-mode: space; line-break: after-white-space;" class="">
<div dir="auto" style="caret-color: rgb(0, 0, 0); color: rgb(0, 0, 0); letter-spacing: normal; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration: none; word-wrap: break-word; -webkit-nbsp-mode: space; line-break: after-white-space;" class="">
<div>Cheers, Andreas</div>
<div>--</div>
<div>Andreas Dilger</div>
<div>Lustre Principal Architect</div>
<div>Whamcloud</div>
<div><br class="">
</div>
<div><br class="">
</div>
<div><br class="">
</div>
</div>
</div>
</div>
</div>
</div>
<br class="Apple-interchange-newline">
</div>
<br class="Apple-interchange-newline">
<br class="Apple-interchange-newline">
</div>
<br class="">
</body>
</html>