[lustre-discuss] Poor(?) Lustre performance

Wed Apr 20 17:13:29 PDT 2022

Andreas,

Thank you again for your detailed reply and time, I will have a look
further at the lustre IO kit and hopefully, get to the bottom of things.

Cheers,
Finn

On Thu, 21 Apr 2022 at 00:52, Andreas Dilger <adilger at whamcloud.com> wrote:

> Finn,
> I can't really say for sure where the performance limitation in your
> system is coming from.
>
>
> You'd have to re-run the tests against the local ldiskfs filesystem to see
> how the performance compares to with that of Lustre.  The important part of
> benchmark testing is to systematically build a complete picture from the
> ground up to see what the capabilities of the various components of the
> storage stack are, and then determine where any bottlenecks are being hit.
>
> That is what the "lustre-iokit" is intended to do - benchmark starting on
> the raw storage (sgpdd-survey), on the local disk filesystem
> (obdfilter-survey for local OSDs), then the network (lnet-selftest), and
> finally on the client (obdfilter-survey for network OSDs).
>
> For example, running sgpdd-survey (or "fio") with small and large IO sizes
> against the storage devices, individually *AND IN PARALLEL* to determine
> their performance characteristics.  Also running in parallel is critical,
> since you may see e.g. 3GB/s reads, 2GB/s writes from a single NVMe device,
> but *not* see 4x that performance when running on 4x NVMe devices because
> of CPU and/or PCI and/or memory bandwidth limitations.  Similarly, you may
> see reasonable per-OSS performance from a single OSS, but network
> congestion (on the client, switch(es), or server) may prevent the
> performance from scaling as more servers are added.
>
> This is described in some detail at
> https://github.com/DDNStorage/lustre_manual_markdown/blob/master/04.02-Benchmarking%20Lustre%20File%20System%20Performance%20(Lustre%20IO%20Kit).md
>
> Cheers, Andreas
>
> On Apr 20, 2022, at 12:03, Finn Rawles Malliagh <up883044 at myport.ac.uk>
> wrote:
>
> Hi Andreas,
>
> Thank you for taking the time to reply with such a detailed response.
> I have taken your advice on board and made some changes. Firstly, I have
> swapped from ZFS and am now using striped LVM groups (Including the P4800X
> instead of using it as a cache drive). I have also modified io500.sh to
> include the optimisation listed above. Rerunning the IO500 benchmark
> provides the metadata results below:
>
> With ZFS
> [RESULT]    mdtest-easy-write        0.931693 kIOPS : time 31.028 seconds
> [INVALID]
> [RESULT]    mdtest-hard-write        0.427000 kIOPS : time 31.070 seconds
> [INVALID]
> [RESULT]                 find       25.311534 kIOPS : time 1.631 seconds
> [RESULT]     mdtest-easy-stat        0.570021 kIOPS : time 50.067 seconds
> [RESULT]     mdtest-hard-stat        1.834985 kIOPS : time 7.998 seconds
> [RESULT]   mdtest-easy-delete        1.715750 kIOPS : time 17.308 seconds
> [RESULT]     mdtest-hard-read        1.006240 kIOPS : time 13.759 seconds
> [RESULT]   mdtest-hard-delete        1.624117 kIOPS : time 8.910 seconds
> [SCORE ] Bandwidth 2.271383 GiB/s : IOPS 1.526825 kiops : TOTAL 1.862258
> [INVALID]
>
> With LVM:
> [RESULT]    mdtest-easy-write        3.057249 kIOPS : time 27.177 seconds
> [INVALID]
> [RESULT]    mdtest-hard-write        1.576865 kIOPS : time 51.740 seconds
> [INVALID]
> [RESULT]                 find       71.979457 kIOPS : time 2.234 seconds
> [RESULT]     mdtest-easy-stat        1.841655 kIOPS : time 44.443 seconds
> [RESULT]     mdtest-hard-stat        1.779211 kIOPS : time 45.967 seconds
> [RESULT]   mdtest-easy-delete        1.559825 kIOPS : time 52.301 seconds
> [RESULT]     mdtest-hard-read        0.631109 kIOPS : time 127.765 seconds
> [RESULT]   mdtest-hard-delete        0.856858 kIOPS : time 94.372 seconds
> [SCORE ] Bandwidth 0.948100 GiB/s : IOPS 2.359024 kiops : TOTAL 1.495524
> [INVALID]
>
> I believe these scores are more in line with what I should expect,
> however, it seems that my throughput performance is still lacking(?). In
> your expert opinion do you think this would be just a case of tuning
> IO500/lvm parameters further or something more fundamental about the
> configuration of this Lustre cluster?
>
> With LVM
> [RESULT]       ior-easy-write        2.127026 GiB/s : time 122.305 seconds
> [INVALID]
> [RESULT]       ior-hard-write        1.408638 GiB/s : time 1.246 seconds
> [INVALID]
> [RESULT]        ior-easy-read        1.549550 GiB/s : time 167.881 seconds
> [RESULT]        ior-hard-read        0.174036 GiB/s : time 10.063 seconds
>
>
> Kind Regards,
> Finn
>
> On Wed, 20 Apr 2022 at 09:24, Andreas Dilger <adilger at whamcloud.com>
> wrote:
>
>> On Apr 16, 2022, at 22:51, Finn Rawles Malliagh via lustre-discuss <
>> lustre-discuss at lists.lustre.org> wrote:
>>
>>
>> Hi all,
>>
>> I have just set up a three-node Lustre configuration, and initial testing
>> shows what I think are slow results. The current configuration is 2 OSS, 1
>> MDS-MGS; each OSS/MGS has 4x Intel P3600, 1x Intel P4800, Intel E810 100Gbe
>> eth, 2x 6252, 380GB dram
>> I am using Lustre 2.12.8, ZFS 0.7.13, ice-1.8.3, rdma-core-35.0 (RoCEv2
>> is enabled)
>> All zpools are setup identical for OST1, OST2, and MDT1
>>
>> [root at stor3 ~]# zpool status
>>   pool: osstank
>>  state: ONLINE
>>   scan: none requested
>> config:
>>         NAME        STATE     READ WRITE CKSUM
>>         osstank     ONLINE       0     0     0
>>           nvme1n1   ONLINE       0     0     0
>>           nvme2n1   ONLINE       0     0     0
>>           nvme3n1   ONLINE       0     0     0
>>         cache
>>           nvme0n1   ONLINE       0     0     0
>>
>>
>> It's been a while since I've done anything with ZFS, but I see a few
>> potential issues here:
>> - firstly, it doesn't make sense IMHO to have an NVMe cache device when
>> the main storage
>>   pool is also NVMe.  You could better use that capacity/bandwidth for
>> storing more data
>>   instead of duplicating it into the cache device.  Also, Lustre cannot
>> use the ZIL.
>> - in general ZFS is not very good at IOPS workloads because of the high
>> overhead per block.
>>   Lustre can't use the ZIL, so no opportunity to accelerate heavy IOPS
>> workloads.
>>
>> When running "./io500 ./config-minimalLUST.ini" on my lustre client, I
>> get these performance numbers:
>> IO500 version io500-isc22_v1 (standard)
>> [RESULT]       ior-easy-write        1.173435 GiB/s : time 31.703 seconds
>> [INVALID]
>> [RESULT]       ior-hard-write        0.821624 GiB/s : time 1.070 seconds
>> [INVALID]
>>
>> [RESULT]        ior-easy-read        5.177930 GiB/s : time 7.187 seconds
>>
>> [RESULT]        ior-hard-read        5.331791 GiB/s : time 0.167 seconds
>>
>>
>> When running "./io500 ./config-minimalLOCAL.ini" on a singular locally
>> mounted ZFS pool I get the following performance numbers:
>> IO500 version io500-isc22_v1 (standard)
>> [RESULT]       ior-easy-write        1.304500 GiB/s : time 33.302 seconds
>> [INVALID]
>>
>> [RESULT]       ior-hard-write        0.485283 GiB/s : time 1.806 seconds
>> [INVALID]
>>
>> [RESULT]        ior-easy-read        3.078668 GiB/s : time 14.111 seconds
>>
>> [RESULT]        ior-hard-read        3.183521 GiB/s : time 0.275 seconds
>>
>>
>> There are definitely some file layout tunables that can improve IO500
>> performance for these workloads.
>> See the default io500.sh file, where they are commented out by default:
>>
>>   # Example commands to create output directories for Lustre.  Creating
>>   # top-level directories is allowed, but not the whole directory tree.
>>   #if (( $(lfs df $workdir | grep -c MDT) > 1 )); then
>>   #  lfs setdirstripe -D -c -1 $workdir
>>   #fi
>>   #lfs setstripe -c 1 $workdir
>>   #mkdir $workdir/ior-easy $workdir/ior-hard
>>   #mkdir $workdir/mdtest-easy $workdir/mdtest-hard
>>   #local osts=$(lfs df $workdir | grep -c OST)
>>   # Try overstriping for ior-hard to improve scaling, or use wide striping
>>   #lfs setstripe -C $((osts * 4)) $workdir/ior-hard ||
>>   #  lfs setstripe -c -1 $workdir/ior-hard
>>   # Try to use DoM if available, otherwise use default for small files
>>   #lfs setstripe -E 64k -L mdt $workdir/mdtest-easy || true #DoM?
>>   #lfs setstripe -E 64k -L mdt $workdir/mdtest-hard || true #DoM?
>>   #lfs setstripe -E 64k -L mdt $workdir/mdtest-rnd
>>
>>
>> As you can see above, the IO performance of Lustre isn't really much
>> different than the local storage
>> performance of ZFS.  You are always going to lose some percentage over
>> the network and because
>> of the added distributed locking.  That said, for the hardware that you
>> have, it should be getting about
>> 2-3GB/s per NVMe device, and up to 10GB/s over the network, so the
>> limitation here is really ZFS.
>> It would be useful to test with ldiskfs on tje same hardware, maybe with
>> LVM aggregating the NVMes.
>>
>> When running "./io500 ./config-minimalLUST.ini" on my lustre client, I
>> get these performance numbers:
>> IO500 version io500-isc22_v1 (standard)
>>
>> [RESULT]    mdtest-easy-write        0.931693 kIOPS : time 31.028 seconds
>> [INVALID]
>>
>> [RESULT]    mdtest-hard-write        0.427000 kIOPS : time 31.070 seconds
>> [INVALID]
>> [RESULT]                 find       25.311534 kIOPS : time 1.631 seconds
>> [RESULT]     mdtest-easy-stat        0.570021 kIOPS : time 50.067 seconds
>> [RESULT]     mdtest-hard-stat        1.834985 kIOPS : time 7.998 seconds
>> [RESULT]   mdtest-easy-delete        1.715750 kIOPS : time 17.308 seconds
>> [RESULT]     mdtest-hard-read        1.006240 kIOPS : time 13.759 seconds
>> [RESULT]   mdtest-hard-delete        1.624117 kIOPS : time 8.910 seconds
>> [SCORE ] Bandwidth 2.271383 GiB/s : IOPS 1.526825 kiops : TOTAL 1.862258
>> [INVALID]
>>
>> When running "./io500 ./config-minimalLOCAL.ini" on a singular locally
>> mounted ZFS pool I get the following performance numbers:
>> IO500 version io500-isc22_v1 (standard)
>>
>> [RESULT]    mdtest-easy-write       47.979181 kIOPS : time 1.838 seconds
>> [INVALID]
>> [RESULT]    mdtest-hard-write       27.801814 kIOPS : time 2.443 seconds
>> [INVALID]
>> [RESULT]                 find     1384.774433 kIOPS : time 0.074 seconds
>> [RESULT]     mdtest-easy-stat      343.232733 kIOPS : time 1.118 seconds
>> [RESULT]     mdtest-hard-stat      333.241620 kIOPS : time 1.123 seconds
>> [RESULT]   mdtest-easy-delete       45.723381 kIOPS : time 1.884 seconds
>> [RESULT]     mdtest-hard-read       73.637312 kIOPS : time 1.546 seconds
>> [RESULT]   mdtest-hard-delete       42.191867 kIOPS : time 1.956 seconds
>> [SCORE ] Bandwidth 1.578256 GiB/s : IOPS 114.726763 kiops : TOTAL
>> 13.456159 [INVALID]
>>
>>
>> Definitely the metadata performance is lower here, because each Lustre
>> file has to create (at least)
>> two objects (one on MDT, one or more on OST(s)) and then write and access
>> them again.
>> Lustre metadata performance would definitely benefit from enabling PFL
>> and Data-on-MDT (per
>> above default commands), since it only needs to do the MDT create/access.
>>
>> I have run an iperf3 test and I was able to reach speeds of around 40GbE
>> so I don't think the network links are the issue (Maybe it's something to
>> do with lnet?)
>>
>> If anyone more knowledgeable than me would please educate me on why the
>> performance of the local three disk ZFS is more performant than the lustre
>> FS.
>> I'm very new to this kind of benchmarking so it may also be that I am
>> misinterpreting the results/ not applying the test correctly.
>>
>>
>>
>> Cheers, Andreas
>> --
>> Andreas Dilger
>> Lustre Principal Architect
>> Whamcloud
>>
>>
>>
>>
>>
>>
>>
>>
> Cheers, Andreas
> --
> Andreas Dilger
> Lustre Principal Architect
> Whamcloud
>
>
>
>
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20220421/beefe658/attachment-0001.html>