[lustre-discuss] Poor(?) Lustre performance

Wed Apr 20 01:24:20 PDT 2022

On Apr 16, 2022, at 22:51, Finn Rawles Malliagh via lustre-discuss <lustre-discuss at lists.lustre.org<mailto:lustre-discuss at lists.lustre.org>> wrote:

Hi all,

I have just set up a three-node Lustre configuration, and initial testing shows what I think are slow results. The current configuration is 2 OSS, 1 MDS-MGS; each OSS/MGS has 4x Intel P3600, 1x Intel P4800, Intel E810 100Gbe eth, 2x 6252, 380GB dram
I am using Lustre 2.12.8, ZFS 0.7.13, ice-1.8.3, rdma-core-35.0 (RoCEv2 is enabled)
All zpools are setup identical for OST1, OST2, and MDT1

[root at stor3 ~]# zpool status
  pool: osstank
 state: ONLINE
  scan: none requested
config:
        NAME        STATE     READ WRITE CKSUM
        osstank     ONLINE       0     0     0
          nvme1n1   ONLINE       0     0     0
          nvme2n1   ONLINE       0     0     0
          nvme3n1   ONLINE       0     0     0
        cache
          nvme0n1   ONLINE       0     0     0

It's been a while since I've done anything with ZFS, but I see a few potential issues here:
- firstly, it doesn't make sense IMHO to have an NVMe cache device when the main storage
  pool is also NVMe.  You could better use that capacity/bandwidth for storing more data
  instead of duplicating it into the cache device.  Also, Lustre cannot use the ZIL.
- in general ZFS is not very good at IOPS workloads because of the high overhead per block.
  Lustre can't use the ZIL, so no opportunity to accelerate heavy IOPS workloads.

When running "./io500 ./config-minimalLUST.ini" on my lustre client, I get these performance numbers:
IO500 version io500-isc22_v1 (standard)
[RESULT]       ior-easy-write        1.173435 GiB/s : time 31.703 seconds [INVALID]
[RESULT]       ior-hard-write        0.821624 GiB/s : time 1.070 seconds [INVALID]
[RESULT]        ior-easy-read        5.177930 GiB/s : time 7.187 seconds
[RESULT]        ior-hard-read        5.331791 GiB/s : time 0.167 seconds

When running "./io500 ./config-minimalLOCAL.ini" on a singular locally mounted ZFS pool I get the following performance numbers:
IO500 version io500-isc22_v1 (standard)
[RESULT]       ior-easy-write        1.304500 GiB/s : time 33.302 seconds [INVALID]
[RESULT]       ior-hard-write        0.485283 GiB/s : time 1.806 seconds [INVALID]
[RESULT]        ior-easy-read        3.078668 GiB/s : time 14.111 seconds
[RESULT]        ior-hard-read        3.183521 GiB/s : time 0.275 seconds

There are definitely some file layout tunables that can improve IO500 performance for these workloads.
See the default io500.sh file, where they are commented out by default:

  # Example commands to create output directories for Lustre.  Creating
  # top-level directories is allowed, but not the whole directory tree.
  #if (( $(lfs df $workdir | grep -c MDT) > 1 )); then
  #  lfs setdirstripe -D -c -1 $workdir
  #fi
  #lfs setstripe -c 1 $workdir
  #mkdir $workdir/ior-easy $workdir/ior-hard
  #mkdir $workdir/mdtest-easy $workdir/mdtest-hard
  #local osts=$(lfs df $workdir | grep -c OST)
  # Try overstriping for ior-hard to improve scaling, or use wide striping
  #lfs setstripe -C $((osts * 4)) $workdir/ior-hard ||
  #  lfs setstripe -c -1 $workdir/ior-hard
  # Try to use DoM if available, otherwise use default for small files
  #lfs setstripe -E 64k -L mdt $workdir/mdtest-easy || true #DoM?
  #lfs setstripe -E 64k -L mdt $workdir/mdtest-hard || true #DoM?
  #lfs setstripe -E 64k -L mdt $workdir/mdtest-rnd

As you can see above, the IO performance of Lustre isn't really much different than the local storage
performance of ZFS.  You are always going to lose some percentage over the network and because
of the added distributed locking.  That said, for the hardware that you have, it should be getting about
2-3GB/s per NVMe device, and up to 10GB/s over the network, so the limitation here is really ZFS.
It would be useful to test with ldiskfs on tje same hardware, maybe with LVM aggregating the NVMes.

When running "./io500 ./config-minimalLUST.ini" on my lustre client, I get these performance numbers:
IO500 version io500-isc22_v1 (standard)
[RESULT]    mdtest-easy-write        0.931693 kIOPS : time 31.028 seconds [INVALID]
[RESULT]    mdtest-hard-write        0.427000 kIOPS : time 31.070 seconds [INVALID]
[RESULT]                 find       25.311534 kIOPS : time 1.631 seconds
[RESULT]     mdtest-easy-stat        0.570021 kIOPS : time 50.067 seconds
[RESULT]     mdtest-hard-stat        1.834985 kIOPS : time 7.998 seconds
[RESULT]   mdtest-easy-delete        1.715750 kIOPS : time 17.308 seconds
[RESULT]     mdtest-hard-read        1.006240 kIOPS : time 13.759 seconds
[RESULT]   mdtest-hard-delete        1.624117 kIOPS : time 8.910 seconds
[SCORE ] Bandwidth 2.271383 GiB/s : IOPS 1.526825 kiops : TOTAL 1.862258 [INVALID]

When running "./io500 ./config-minimalLOCAL.ini" on a singular locally mounted ZFS pool I get the following performance numbers:
IO500 version io500-isc22_v1 (standard)
[RESULT]    mdtest-easy-write       47.979181 kIOPS : time 1.838 seconds [INVALID]
[RESULT]    mdtest-hard-write       27.801814 kIOPS : time 2.443 seconds [INVALID]
[RESULT]                 find     1384.774433 kIOPS : time 0.074 seconds
[RESULT]     mdtest-easy-stat      343.232733 kIOPS : time 1.118 seconds
[RESULT]     mdtest-hard-stat      333.241620 kIOPS : time 1.123 seconds
[RESULT]   mdtest-easy-delete       45.723381 kIOPS : time 1.884 seconds
[RESULT]     mdtest-hard-read       73.637312 kIOPS : time 1.546 seconds
[RESULT]   mdtest-hard-delete       42.191867 kIOPS : time 1.956 seconds
[SCORE ] Bandwidth 1.578256 GiB/s : IOPS 114.726763 kiops : TOTAL 13.456159 [INVALID]

Definitely the metadata performance is lower here, because each Lustre file has to create (at least)
two objects (one on MDT, one or more on OST(s)) and then write and access them again.
Lustre metadata performance would definitely benefit from enabling PFL and Data-on-MDT (per
above default commands), since it only needs to do the MDT create/access.

I have run an iperf3 test and I was able to reach speeds of around 40GbE so I don't think the network links are the issue (Maybe it's something to do with lnet?)

If anyone more knowledgeable than me would please educate me on why the performance of the local three disk ZFS is more performant than the lustre FS.
I'm very new to this kind of benchmarking so it may also be that I am misinterpreting the results/ not applying the test correctly.

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Whamcloud

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20220420/448e1201/attachment-0001.html>