[lustre-discuss] Poor(?) Lustre performance

Wed Apr 20 16:52:48 PDT 2022

Finn,
I can't really say for sure where the performance limitation in your system is coming from.

You'd have to re-run the tests against the local ldiskfs filesystem to see how the performance compares to with that of Lustre.  The important part of benchmark testing is to systematically build a complete picture from the ground up to see what the capabilities of the various components of the storage stack are, and then determine where any bottlenecks are being hit.

That is what the "lustre-iokit" is intended to do - benchmark starting on the raw storage (sgpdd-survey), on the local disk filesystem (obdfilter-survey for local OSDs), then the network (lnet-selftest), and finally on the client (obdfilter-survey for network OSDs).

For example, running sgpdd-survey (or "fio") with small and large IO sizes against the storage devices, individually *AND IN PARALLEL* to determine their performance characteristics.  Also running in parallel is critical, since you may see e.g. 3GB/s reads, 2GB/s writes from a single NVMe device, but *not* see 4x that performance when running on 4x NVMe devices because of CPU and/or PCI and/or memory bandwidth limitations.  Similarly, you may see reasonable per-OSS performance from a single OSS, but network congestion (on the client, switch(es), or server) may prevent the performance from scaling as more servers are added.

This is described in some detail at https://github.com/DDNStorage/lustre_manual_markdown/blob/master/04.02-Benchmarking%20Lustre%20File%20System%20Performance%20(Lustre%20IO%20Kit).md

Cheers, Andreas

On Apr 20, 2022, at 12:03, Finn Rawles Malliagh <up883044 at myport.ac.uk<mailto:up883044 at myport.ac.uk>> wrote:

Hi Andreas,

Thank you for taking the time to reply with such a detailed response.
I have taken your advice on board and made some changes. Firstly, I have swapped from ZFS and am now using striped LVM groups (Including the P4800X instead of using it as a cache drive). I have also modified io500.sh to include the optimisation listed above. Rerunning the IO500 benchmark provides the metadata results below:

With ZFS
[RESULT]    mdtest-easy-write        0.931693 kIOPS : time 31.028 seconds [INVALID]
[RESULT]    mdtest-hard-write        0.427000 kIOPS : time 31.070 seconds [INVALID]
[RESULT]                 find       25.311534 kIOPS : time 1.631 seconds
[RESULT]     mdtest-easy-stat        0.570021 kIOPS : time 50.067 seconds
[RESULT]     mdtest-hard-stat        1.834985 kIOPS : time 7.998 seconds
[RESULT]   mdtest-easy-delete        1.715750 kIOPS : time 17.308 seconds
[RESULT]     mdtest-hard-read        1.006240 kIOPS : time 13.759 seconds
[RESULT]   mdtest-hard-delete        1.624117 kIOPS : time 8.910 seconds
[SCORE ] Bandwidth 2.271383 GiB/s : IOPS 1.526825 kiops : TOTAL 1.862258 [INVALID]

With LVM:
[RESULT]    mdtest-easy-write        3.057249 kIOPS : time 27.177 seconds [INVALID]
[RESULT]    mdtest-hard-write        1.576865 kIOPS : time 51.740 seconds [INVALID]
[RESULT]                 find       71.979457 kIOPS : time 2.234 seconds
[RESULT]     mdtest-easy-stat        1.841655 kIOPS : time 44.443 seconds
[RESULT]     mdtest-hard-stat        1.779211 kIOPS : time 45.967 seconds
[RESULT]   mdtest-easy-delete        1.559825 kIOPS : time 52.301 seconds
[RESULT]     mdtest-hard-read        0.631109 kIOPS : time 127.765 seconds
[RESULT]   mdtest-hard-delete        0.856858 kIOPS : time 94.372 seconds
[SCORE ] Bandwidth 0.948100 GiB/s : IOPS 2.359024 kiops : TOTAL 1.495524 [INVALID]

I believe these scores are more in line with what I should expect, however, it seems that my throughput performance is still lacking(?). In your expert opinion do you think this would be just a case of tuning IO500/lvm parameters further or something more fundamental about the configuration of this Lustre cluster?

With LVM
[RESULT]       ior-easy-write        2.127026 GiB/s : time 122.305 seconds [INVALID]
[RESULT]       ior-hard-write        1.408638 GiB/s : time 1.246 seconds [INVALID]
[RESULT]        ior-easy-read        1.549550 GiB/s : time 167.881 seconds
[RESULT]        ior-hard-read        0.174036 GiB/s : time 10.063 seconds

Kind Regards,
Finn

On Wed, 20 Apr 2022 at 09:24, Andreas Dilger <adilger at whamcloud.com<mailto:adilger at whamcloud.com>> wrote:
On Apr 16, 2022, at 22:51, Finn Rawles Malliagh via lustre-discuss <lustre-discuss at lists.lustre.org<mailto:lustre-discuss at lists.lustre.org>> wrote:

Hi all,

I have just set up a three-node Lustre configuration, and initial testing shows what I think are slow results. The current configuration is 2 OSS, 1 MDS-MGS; each OSS/MGS has 4x Intel P3600, 1x Intel P4800, Intel E810 100Gbe eth, 2x 6252, 380GB dram
I am using Lustre 2.12.8, ZFS 0.7.13, ice-1.8.3, rdma-core-35.0 (RoCEv2 is enabled)
All zpools are setup identical for OST1, OST2, and MDT1

[root at stor3 ~]# zpool status
  pool: osstank
 state: ONLINE
  scan: none requested
config:
        NAME        STATE     READ WRITE CKSUM
        osstank     ONLINE       0     0     0
          nvme1n1   ONLINE       0     0     0
          nvme2n1   ONLINE       0     0     0
          nvme3n1   ONLINE       0     0     0
        cache
          nvme0n1   ONLINE       0     0     0

It's been a while since I've done anything with ZFS, but I see a few potential issues here:
- firstly, it doesn't make sense IMHO to have an NVMe cache device when the main storage
  pool is also NVMe.  You could better use that capacity/bandwidth for storing more data
  instead of duplicating it into the cache device.  Also, Lustre cannot use the ZIL.
- in general ZFS is not very good at IOPS workloads because of the high overhead per block.
  Lustre can't use the ZIL, so no opportunity to accelerate heavy IOPS workloads.

When running "./io500 ./config-minimalLUST.ini" on my lustre client, I get these performance numbers:
IO500 version io500-isc22_v1 (standard)
[RESULT]       ior-easy-write        1.173435 GiB/s : time 31.703 seconds [INVALID]
[RESULT]       ior-hard-write        0.821624 GiB/s : time 1.070 seconds [INVALID]
[RESULT]        ior-easy-read        5.177930 GiB/s : time 7.187 seconds
[RESULT]        ior-hard-read        5.331791 GiB/s : time 0.167 seconds

When running "./io500 ./config-minimalLOCAL.ini" on a singular locally mounted ZFS pool I get the following performance numbers:
IO500 version io500-isc22_v1 (standard)
[RESULT]       ior-easy-write        1.304500 GiB/s : time 33.302 seconds [INVALID]
[RESULT]       ior-hard-write        0.485283 GiB/s : time 1.806 seconds [INVALID]
[RESULT]        ior-easy-read        3.078668 GiB/s : time 14.111 seconds
[RESULT]        ior-hard-read        3.183521 GiB/s : time 0.275 seconds

There are definitely some file layout tunables that can improve IO500 performance for these workloads.
See the default io500.sh file, where they are commented out by default:

  # Example commands to create output directories for Lustre.  Creating
  # top-level directories is allowed, but not the whole directory tree.
  #if (( $(lfs df $workdir | grep -c MDT) > 1 )); then
  #  lfs setdirstripe -D -c -1 $workdir
  #fi
  #lfs setstripe -c 1 $workdir
  #mkdir $workdir/ior-easy $workdir/ior-hard
  #mkdir $workdir/mdtest-easy $workdir/mdtest-hard
  #local osts=$(lfs df $workdir | grep -c OST)
  # Try overstriping for ior-hard to improve scaling, or use wide striping
  #lfs setstripe -C $((osts * 4)) $workdir/ior-hard ||
  #  lfs setstripe -c -1 $workdir/ior-hard
  # Try to use DoM if available, otherwise use default for small files
  #lfs setstripe -E 64k -L mdt $workdir/mdtest-easy || true #DoM?
  #lfs setstripe -E 64k -L mdt $workdir/mdtest-hard || true #DoM?
  #lfs setstripe -E 64k -L mdt $workdir/mdtest-rnd

As you can see above, the IO performance of Lustre isn't really much different than the local storage
performance of ZFS.  You are always going to lose some percentage over the network and because
of the added distributed locking.  That said, for the hardware that you have, it should be getting about
2-3GB/s per NVMe device, and up to 10GB/s over the network, so the limitation here is really ZFS.
It would be useful to test with ldiskfs on tje same hardware, maybe with LVM aggregating the NVMes.

When running "./io500 ./config-minimalLUST.ini" on my lustre client, I get these performance numbers:
IO500 version io500-isc22_v1 (standard)
[RESULT]    mdtest-easy-write        0.931693 kIOPS : time 31.028 seconds [INVALID]
[RESULT]    mdtest-hard-write        0.427000 kIOPS : time 31.070 seconds [INVALID]
[RESULT]                 find       25.311534 kIOPS : time 1.631 seconds
[RESULT]     mdtest-easy-stat        0.570021 kIOPS : time 50.067 seconds
[RESULT]     mdtest-hard-stat        1.834985 kIOPS : time 7.998 seconds
[RESULT]   mdtest-easy-delete        1.715750 kIOPS : time 17.308 seconds
[RESULT]     mdtest-hard-read        1.006240 kIOPS : time 13.759 seconds
[RESULT]   mdtest-hard-delete        1.624117 kIOPS : time 8.910 seconds
[SCORE ] Bandwidth 2.271383 GiB/s : IOPS 1.526825 kiops : TOTAL 1.862258 [INVALID]

When running "./io500 ./config-minimalLOCAL.ini" on a singular locally mounted ZFS pool I get the following performance numbers:
IO500 version io500-isc22_v1 (standard)
[RESULT]    mdtest-easy-write       47.979181 kIOPS : time 1.838 seconds [INVALID]
[RESULT]    mdtest-hard-write       27.801814 kIOPS : time 2.443 seconds [INVALID]
[RESULT]                 find     1384.774433 kIOPS : time 0.074 seconds
[RESULT]     mdtest-easy-stat      343.232733 kIOPS : time 1.118 seconds
[RESULT]     mdtest-hard-stat      333.241620 kIOPS : time 1.123 seconds
[RESULT]   mdtest-easy-delete       45.723381 kIOPS : time 1.884 seconds
[RESULT]     mdtest-hard-read       73.637312 kIOPS : time 1.546 seconds
[RESULT]   mdtest-hard-delete       42.191867 kIOPS : time 1.956 seconds
[SCORE ] Bandwidth 1.578256 GiB/s : IOPS 114.726763 kiops : TOTAL 13.456159 [INVALID]

Definitely the metadata performance is lower here, because each Lustre file has to create (at least)
two objects (one on MDT, one or more on OST(s)) and then write and access them again.
Lustre metadata performance would definitely benefit from enabling PFL and Data-on-MDT (per
above default commands), since it only needs to do the MDT create/access.

I have run an iperf3 test and I was able to reach speeds of around 40GbE so I don't think the network links are the issue (Maybe it's something to do with lnet?)

If anyone more knowledgeable than me would please educate me on why the performance of the local three disk ZFS is more performant than the lustre FS.
I'm very new to this kind of benchmarking so it may also be that I am misinterpreting the results/ not applying the test correctly.

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Whamcloud

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Whamcloud

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20220420/462c18e7/attachment-0001.html>