[lustre-discuss] Interesting read bandwidth observations

Mon Aug 2 21:03:03 PDT 2021

The read bandwidth on my test setup is 20%-55% of the bandwidth I was able to get from a ldiskfs based Lustre 2.14 on the same VMs running CentOS 8. I’d like feedback on the observations I made during my analysis.

SETUP:
My test setup has four VMs on a single host where:

  1.  VM1: MGS + MDS
  2.  VM2: OSS1 (w/ two 40GB OSTs)
  3.  VM3: OSS2 (w/ one 40GB OST)
  4.  VM4: POSIX client
I’m running Lustre 2.12.6 on CentOS 7, kernel 3.10.0-1160.2.1.el7.x86_64. The OSTs use ZFS. The complete install was done using RPMs – no custom builds.

BENCHMARK:
While this observation holds true for several of tests I’ve performed, I will only describe one of them here.
I’m running fio on the client to read 200 imagenet files. 4k block size, ioengine=psync. Iodepth=1. The total size of the data set ranges from is 23.4GB. All machines have 32GB of memory.
The files are cached in the OSS. And therefore, they are really being transferred from the OSS memory to the client memory and there is no disk activity to either OST.

OBSERVATIONS:
The following four graphs were plotted based on the data I collected from the four nodes using sar. After the initial blip where the client talks to the MGS, the client-OSS throughput slowly ramps down from an initial peak rate of 25 MBps.
The client CPU usage peaks at about 55% (not shown) and the run queue size is never over 20 (also not shown) leading me to the believe that we’re not limited by the client CPU at any point during the transfer.
[cid:image001.png at 01D787E1.C8057070]

[cid:image002.png at 01D787E1.C8057070]

[cid:image003.png at 01D787E1.C8057070]

[cid:image004.png at 01D787E1.C8057070]

QUESTION:
Why aren’t the two OSSs able to start sending data faster? What is causing the TX rate to climb gradually? Neither OSS nodes are CPU or IO limited. arcstat.py shows memory reads in line with the incoming traffic (see image below).
For bigger workloads, it just takes longer to hit the same peak bandwidth. I’m never able to achieve network saturation.  Where is the performance bottleneck?

[cid:image005.png at 01D787E1.C8057070]

Thanks,
Vinayak
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20210803/d65b3522/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image001.png
Type: image/png
Size: 41613 bytes
Desc: image001.png
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20210803/d65b3522/attachment-0005.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image002.png
Type: image/png
Size: 25100 bytes
Desc: image002.png
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20210803/d65b3522/attachment-0006.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image003.png
Type: image/png
Size: 45849 bytes
Desc: image003.png
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20210803/d65b3522/attachment-0007.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image004.png
Type: image/png
Size: 39754 bytes
Desc: image004.png
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20210803/d65b3522/attachment-0008.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image005.png
Type: image/png
Size: 32019 bytes
Desc: image005.png
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20210803/d65b3522/attachment-0009.png>