[lustre-discuss] Cant reach full throughput bandwith on Mellanox

Tue Mar 31 02:23:41 PDT 2026

On Mar 29, 2026, at 11:33, Stepan Beskrovnyy <bsm099 at gmail.com> wrote:
> Hello everyone! 
> 
> I made a bunch of self-tests about EC branch performance so far. 

Thank you for testing this code.  It is still under heavy development,
so the performance is not the central focus yet.

> About my network config, I got:
> 
> 7 servers with Mellanox ConnectX-7 with 2x100G ports. 
> 
> Lustre topology: 
> 1 MGS/MDT server 
> 6 servers with 2 OSS 
> 
> At all 12 OSS’s and 1 MDT. All OSS sit on spdk raid0 of 8 nvme with big throughputput. 
> 
> Using EC 10+2 on cluster and 1M stripe pattern. 
> 
> ib_write/read work well between nodes, shows 100G per interface. All the packets send to prio3. 

So this would be about 6 x 2 x 100Gbit/s ~= 120 GB/s maximum network speed,
but unclear what the storage throughput is.

> But at IOR benchmark I got only 65Gib/s on read and write both(—posix.odirect, no caching). 

Note that O_DIRECT will perform better with larger IO sizes (8MiB+ or larger).

> Network utilization due tests is only about 20%. 
> 
> How can I tune my configuration more ? Any Ideas? 

Have you tested the ISA-L assembly-optimized patches that were very recently
added at the top of the EC patch series?  Those improve EC calculation speed
from about 550MB/s up to 20GB/s for x86_64, so will improve performance very
significantly for the EC calculation part.  Whether that is your bottleneck
remains to be seen.

> And another question, is there any way in Erasure Coding branch to choose parity-block OSS’s manually? 

Yes, later in the patch series there is a "failure_domain" configuration
parameter added for OSTs that allows grouping OSTs into failure domains
(as you see fit, whether per-OSS, per-failover pair, per-rack, etc.  The
object allocator will not allocate multiple objects from the same failure
domain for data or parity stripes into a raidset (e.g. 8+2 data+parity
grouping of stripes).

If the file is widely striped then it will be split into multiple raidsets
with the requested geometry so that each raidset is independently recoverable,
but allows using the full bandwidth of the storage.

Cheers, Andreas
---
Andreas Dilger
Principal Lustre Architect
adilger at thelustrecollective.com