[lustre-discuss] omnipath and lnet_selftest performance
Michael DiDomenico
mdidomenico4 at gmail.com
Mon Jul 8 07:43:03 PDT 2024
these are the settings in the manual, which i tried. i'll check the
conns_per_peer setting though, i'm not sure what mine was set to
lnd tunables:
peercredits_hiw: 64
map_on_demand: 32
concurrent_sends: 256
fmr_pool_size: 2048
fmr_flush_trigger: 512
fmr_cache: 1
tunables:
peer_timeout: 180
peer_credits: 128
peer_buffer_credits: 0
credits: 1024
CPT: "[0,0,0,0]"
On Sat, Jul 6, 2024 at 6:07 AM Andreas Dilger <adilger at whamcloud.com> wrote:
>
>
>
> On Jul 5, 2024, at 11:37, Michael DiDomenico via lustre-discuss <lustre-discuss at lists.lustre.org> wrote:
>
> i could use a little help with lustre clients over omni path. when i
> run ib_write_bw tests between two compute nodes i get +10GB/sec.
> compute nodes are rhel9.4 with rhel hw drivers
>
> however, when i run lnet_selftest between the same two compute nodes
>
> 1m i/o size
> 16 concurrency
>
> node1-node3
> read 1m i/o ~7.1GB/sec
> write 1m i/o ~4.7GB/sec
>
> node3-node1
> read 1m i/o ~6.6GB/sec
> write 1m i/o ~4.9GB/sec
>
> varying the i/o size and concurrency changes the numbers, but not
> dramatically. i've gone through the tuning guide for omnipath and my
> lnd tunables all match, but i can't seem to drive the bandwidth any
> higher between nodes.
>
>
> Please provide the actual tuning parameters in use.
>
> Even when we were part of Intel, the OPA tuning parameters suggested by
> the OPA team were not necessarily the best in all cases. There was some
> kind of memory registration they kept suggesting, but it was always worse
> in practice than in theory.
>
> The biggest win was from setting conns_per_peer=4 or so, because OPA
> needs more CPU resources for good performance than IB.
>
> That said, it has been several years since I've had to deal with it, so I can't
> say if your current performance is good or bad..
>
> can anyone suggest where i might be dropping some performance or is
> this the end? i feel like there should be more performance here, but
> since we recently retooled from rhel7 to rhel9, i'm unsure if there's
> a tunable not tuned. (unfortunately i don't have/can't seem to find
> previous numbers to compare)
>
>
> Cheers, Andreas
> --
> Andreas Dilger
> Lustre Principal Architect
> Whamcloud
>
>
>
>
>
>
>
More information about the lustre-discuss
mailing list