[lustre-discuss] Cant reach full throughput bandwith on Mellanox

Thu Apr 2 03:34:00 PDT 2026

Stepan,
are you using buffered or direct IO on the clients?  Using buffered IO has
a lot of overhead in the kernel, so DIO with large records (32MB+) can hit
significantly higher performance in most cases.

Cheers, Andreas

> On Mar 31, 2026, at 04:42, Stepan Beskrovnyy <bsm099 at gmail.com> wrote:
> 
> Thanks Andreas ! 
> 
> 
> Got a little update ! 
> 
> 
> I have noticed, that I reach limit up to 17Gb/s per client in different benchmarks. 
> If I use IOR I got on write: 
> 
> 1 client - 15-17Gb/s (mpirun -np 64)
> 2 client - 30-34Gb/s ( mpirun -np 128)
> 3 client - 45-50Gb/s (mpirun -np 192)
> 4 client - 60-67Gb/s (mpirun -np 256) 
> 
> Each client have 64 threads and runs all of them. 
> 
> 
> If I use FIO benchmark on a directory, I reach 17Gb/s as a limit to, starts from —jobs=32 and iodepth=1. 
> 
> Is there a way to overpass this limit ? Is it depend on kernel properties? What parameters determine the throughput of a single client?
> 
> 
> Thanks, 
> Stepan
> 
> Вт, 31 марта 2026 г. в 12:23, Andreas Dilger <adilger at thelustrecollective.com>:
> On Mar 29, 2026, at 11:33, Stepan Beskrovnyy <bsm099 at gmail.com> wrote:
> > Hello everyone! 
> > 
> > I made a bunch of self-tests about EC branch performance so far. 
> 
> Thank you for testing this code.  It is still under heavy development,
> so the performance is not the central focus yet.
> 
> > About my network config, I got:
> > 
> > 7 servers with Mellanox ConnectX-7 with 2x100G ports. 
> > 
> > Lustre topology: 
> > 1 MGS/MDT server 
> > 6 servers with 2 OSS 
> > 
> > At all 12 OSS’s and 1 MDT. All OSS sit on spdk raid0 of 8 nvme with big throughputput. 
> > 
> > Using EC 10+2 on cluster and 1M stripe pattern. 
> > 
> > ib_write/read work well between nodes, shows 100G per interface. All the packets send to prio3. 
> 
> So this would be about 6 x 2 x 100Gbit/s ~= 120 GB/s maximum network speed,
> but unclear what the storage throughput is.
> 
> > But at IOR benchmark I got only 65Gib/s on read and write both(—posix.odirect, no caching). 
> 
> Note that O_DIRECT will perform better with larger IO sizes (8MiB+ or larger).
> 
> > Network utilization due tests is only about 20%. 
> > 
> > How can I tune my configuration more ? Any Ideas? 
> 
> Have you tested the ISA-L assembly-optimized patches that were very recently
> added at the top of the EC patch series?  Those improve EC calculation speed
> from about 550MB/s up to 20GB/s for x86_64, so will improve performance very
> significantly for the EC calculation part.  Whether that is your bottleneck
> remains to be seen.
> 
> > And another question, is there any way in Erasure Coding branch to choose parity-block OSS’s manually? 
> 
> Yes, later in the patch series there is a "failure_domain" configuration
> parameter added for OSTs that allows grouping OSTs into failure domains
> (as you see fit, whether per-OSS, per-failover pair, per-rack, etc.  The
> object allocator will not allocate multiple objects from the same failure
> domain for data or parity stripes into a raidset (e.g. 8+2 data+parity
> grouping of stripes).
> 
> If the file is widely striped then it will be split into multiple raidsets
> with the requested geometry so that each raidset is independently recoverable,
> but allows using the full bandwidth of the storage.
> 
> Cheers, Andreas
> ---
> Andreas Dilger
> Principal Lustre Architect
> adilger at thelustrecollective.com
> 
> 
> 
> 

---
Andreas Dilger
Principal Lustre Architect
adilger at thelustrecollective.com