[lustre-discuss] Poor(?) Lustre performance
Finn Rawles Malliagh
up883044 at myport.ac.uk
Wed Apr 20 11:03:01 PDT 2022
Hi Andreas,
Thank you for taking the time to reply with such a detailed response.
I have taken your advice on board and made some changes. Firstly, I have
swapped from ZFS and am now using striped LVM groups (Including the P4800X
instead of using it as a cache drive). I have also modified io500.sh to
include the optimisation listed above. Rerunning the IO500 benchmark
provides the metadata results below:
With ZFS
[RESULT] mdtest-easy-write 0.931693 kIOPS : time 31.028 seconds
[INVALID]
[RESULT] mdtest-hard-write 0.427000 kIOPS : time 31.070 seconds
[INVALID]
[RESULT] find 25.311534 kIOPS : time 1.631 seconds
[RESULT] mdtest-easy-stat 0.570021 kIOPS : time 50.067 seconds
[RESULT] mdtest-hard-stat 1.834985 kIOPS : time 7.998 seconds
[RESULT] mdtest-easy-delete 1.715750 kIOPS : time 17.308 seconds
[RESULT] mdtest-hard-read 1.006240 kIOPS : time 13.759 seconds
[RESULT] mdtest-hard-delete 1.624117 kIOPS : time 8.910 seconds
[SCORE ] Bandwidth 2.271383 GiB/s : IOPS 1.526825 kiops : TOTAL 1.862258
[INVALID]
With LVM:
[RESULT] mdtest-easy-write 3.057249 kIOPS : time 27.177 seconds
[INVALID]
[RESULT] mdtest-hard-write 1.576865 kIOPS : time 51.740 seconds
[INVALID]
[RESULT] find 71.979457 kIOPS : time 2.234 seconds
[RESULT] mdtest-easy-stat 1.841655 kIOPS : time 44.443 seconds
[RESULT] mdtest-hard-stat 1.779211 kIOPS : time 45.967 seconds
[RESULT] mdtest-easy-delete 1.559825 kIOPS : time 52.301 seconds
[RESULT] mdtest-hard-read 0.631109 kIOPS : time 127.765 seconds
[RESULT] mdtest-hard-delete 0.856858 kIOPS : time 94.372 seconds
[SCORE ] Bandwidth 0.948100 GiB/s : IOPS 2.359024 kiops : TOTAL 1.495524
[INVALID]
I believe these scores are more in line with what I should expect, however,
it seems that my throughput performance is still lacking(?). In your expert
opinion do you think this would be just a case of tuning IO500/lvm
parameters further or something more fundamental about the configuration of
this Lustre cluster?
With LVM
[RESULT] ior-easy-write 2.127026 GiB/s : time 122.305 seconds
[INVALID]
[RESULT] ior-hard-write 1.408638 GiB/s : time 1.246 seconds
[INVALID]
[RESULT] ior-easy-read 1.549550 GiB/s : time 167.881 seconds
[RESULT] ior-hard-read 0.174036 GiB/s : time 10.063 seconds
Kind Regards,
Finn
On Wed, 20 Apr 2022 at 09:24, Andreas Dilger <adilger at whamcloud.com> wrote:
> On Apr 16, 2022, at 22:51, Finn Rawles Malliagh via lustre-discuss <
> lustre-discuss at lists.lustre.org> wrote:
>
>
> Hi all,
>
> I have just set up a three-node Lustre configuration, and initial testing
> shows what I think are slow results. The current configuration is 2 OSS, 1
> MDS-MGS; each OSS/MGS has 4x Intel P3600, 1x Intel P4800, Intel E810 100Gbe
> eth, 2x 6252, 380GB dram
> I am using Lustre 2.12.8, ZFS 0.7.13, ice-1.8.3, rdma-core-35.0 (RoCEv2 is
> enabled)
> All zpools are setup identical for OST1, OST2, and MDT1
>
> [root at stor3 ~]# zpool status
> pool: osstank
> state: ONLINE
> scan: none requested
> config:
> NAME STATE READ WRITE CKSUM
> osstank ONLINE 0 0 0
> nvme1n1 ONLINE 0 0 0
> nvme2n1 ONLINE 0 0 0
> nvme3n1 ONLINE 0 0 0
> cache
> nvme0n1 ONLINE 0 0 0
>
>
> It's been a while since I've done anything with ZFS, but I see a few
> potential issues here:
> - firstly, it doesn't make sense IMHO to have an NVMe cache device when
> the main storage
> pool is also NVMe. You could better use that capacity/bandwidth for
> storing more data
> instead of duplicating it into the cache device. Also, Lustre cannot
> use the ZIL.
> - in general ZFS is not very good at IOPS workloads because of the high
> overhead per block.
> Lustre can't use the ZIL, so no opportunity to accelerate heavy IOPS
> workloads.
>
> When running "./io500 ./config-minimalLUST.ini" on my lustre client, I get
> these performance numbers:
> IO500 version io500-isc22_v1 (standard)
> [RESULT] ior-easy-write 1.173435 GiB/s : time 31.703 seconds
> [INVALID]
> [RESULT] ior-hard-write 0.821624 GiB/s : time 1.070 seconds
> [INVALID]
>
> [RESULT] ior-easy-read 5.177930 GiB/s : time 7.187 seconds
>
> [RESULT] ior-hard-read 5.331791 GiB/s : time 0.167 seconds
>
>
> When running "./io500 ./config-minimalLOCAL.ini" on a singular locally
> mounted ZFS pool I get the following performance numbers:
> IO500 version io500-isc22_v1 (standard)
> [RESULT] ior-easy-write 1.304500 GiB/s : time 33.302 seconds
> [INVALID]
>
> [RESULT] ior-hard-write 0.485283 GiB/s : time 1.806 seconds
> [INVALID]
>
> [RESULT] ior-easy-read 3.078668 GiB/s : time 14.111 seconds
>
> [RESULT] ior-hard-read 3.183521 GiB/s : time 0.275 seconds
>
>
> There are definitely some file layout tunables that can improve IO500
> performance for these workloads.
> See the default io500.sh file, where they are commented out by default:
>
> # Example commands to create output directories for Lustre. Creating
> # top-level directories is allowed, but not the whole directory tree.
> #if (( $(lfs df $workdir | grep -c MDT) > 1 )); then
> # lfs setdirstripe -D -c -1 $workdir
> #fi
> #lfs setstripe -c 1 $workdir
> #mkdir $workdir/ior-easy $workdir/ior-hard
> #mkdir $workdir/mdtest-easy $workdir/mdtest-hard
> #local osts=$(lfs df $workdir | grep -c OST)
> # Try overstriping for ior-hard to improve scaling, or use wide striping
> #lfs setstripe -C $((osts * 4)) $workdir/ior-hard ||
> # lfs setstripe -c -1 $workdir/ior-hard
> # Try to use DoM if available, otherwise use default for small files
> #lfs setstripe -E 64k -L mdt $workdir/mdtest-easy || true #DoM?
> #lfs setstripe -E 64k -L mdt $workdir/mdtest-hard || true #DoM?
> #lfs setstripe -E 64k -L mdt $workdir/mdtest-rnd
>
>
> As you can see above, the IO performance of Lustre isn't really much
> different than the local storage
> performance of ZFS. You are always going to lose some percentage over the
> network and because
> of the added distributed locking. That said, for the hardware that you
> have, it should be getting about
> 2-3GB/s per NVMe device, and up to 10GB/s over the network, so the
> limitation here is really ZFS.
> It would be useful to test with ldiskfs on tje same hardware, maybe with
> LVM aggregating the NVMes.
>
> When running "./io500 ./config-minimalLUST.ini" on my lustre client, I get
> these performance numbers:
> IO500 version io500-isc22_v1 (standard)
>
> [RESULT] mdtest-easy-write 0.931693 kIOPS : time 31.028 seconds
> [INVALID]
>
> [RESULT] mdtest-hard-write 0.427000 kIOPS : time 31.070 seconds
> [INVALID]
> [RESULT] find 25.311534 kIOPS : time 1.631 seconds
> [RESULT] mdtest-easy-stat 0.570021 kIOPS : time 50.067 seconds
> [RESULT] mdtest-hard-stat 1.834985 kIOPS : time 7.998 seconds
> [RESULT] mdtest-easy-delete 1.715750 kIOPS : time 17.308 seconds
> [RESULT] mdtest-hard-read 1.006240 kIOPS : time 13.759 seconds
> [RESULT] mdtest-hard-delete 1.624117 kIOPS : time 8.910 seconds
> [SCORE ] Bandwidth 2.271383 GiB/s : IOPS 1.526825 kiops : TOTAL 1.862258
> [INVALID]
>
> When running "./io500 ./config-minimalLOCAL.ini" on a singular locally
> mounted ZFS pool I get the following performance numbers:
> IO500 version io500-isc22_v1 (standard)
>
> [RESULT] mdtest-easy-write 47.979181 kIOPS : time 1.838 seconds
> [INVALID]
> [RESULT] mdtest-hard-write 27.801814 kIOPS : time 2.443 seconds
> [INVALID]
> [RESULT] find 1384.774433 kIOPS : time 0.074 seconds
> [RESULT] mdtest-easy-stat 343.232733 kIOPS : time 1.118 seconds
> [RESULT] mdtest-hard-stat 333.241620 kIOPS : time 1.123 seconds
> [RESULT] mdtest-easy-delete 45.723381 kIOPS : time 1.884 seconds
> [RESULT] mdtest-hard-read 73.637312 kIOPS : time 1.546 seconds
> [RESULT] mdtest-hard-delete 42.191867 kIOPS : time 1.956 seconds
> [SCORE ] Bandwidth 1.578256 GiB/s : IOPS 114.726763 kiops : TOTAL
> 13.456159 [INVALID]
>
>
> Definitely the metadata performance is lower here, because each Lustre
> file has to create (at least)
> two objects (one on MDT, one or more on OST(s)) and then write and access
> them again.
> Lustre metadata performance would definitely benefit from enabling PFL and
> Data-on-MDT (per
> above default commands), since it only needs to do the MDT create/access.
>
> I have run an iperf3 test and I was able to reach speeds of around 40GbE
> so I don't think the network links are the issue (Maybe it's something to
> do with lnet?)
>
> If anyone more knowledgeable than me would please educate me on why the
> performance of the local three disk ZFS is more performant than the lustre
> FS.
> I'm very new to this kind of benchmarking so it may also be that I am
> misinterpreting the results/ not applying the test correctly.
>
>
>
> Cheers, Andreas
> --
> Andreas Dilger
> Lustre Principal Architect
> Whamcloud
>
>
>
>
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20220420/906e029e/attachment.html>
More information about the lustre-discuss
mailing list