[lustre-discuss] LNet supports IB HDR(200Gb) network?

Jongwoo Han jongwoohan at gmail.com
Thu Oct 28 03:49:53 PDT 2021


One possible cause of bandwidth limitation might come from PCIE slot
limitation - HDR200 will require at least PCIE 4.0 16x lanes to work at
maximum bandwidth, while many x86 servers give only 8x lanes to IO adapters.

Generally 100Gb bandwidth limitation will not be be problem, because most
of OST configurations available today cannot endure such workloads.  Yet
when 200Gb throughput must be provided to OSS, there is an alternative such
as dynamic load balancing on multi-rail configuration ( using 2 x 100Gb
ports links for OSSes).

regards,
Jongwoo Han

2021년 10월 19일 (화) 오후 7:31, 홍재기 via lustre-discuss <
lustre-discuss at lists.lustre.org>님이 작성:

> Hi, all
>
>
>
> I am setting up Lustre on local cluster based on infiniband HDR(200Gb,
> single port) network.
>
> I could successfuly setup Lustre using 5 servers(1-MDS, 4-OSS)
>
>
>
> Even though I have verified the IB HDR bandwidth(200Gb/s) with
> 'ib_read_bw' or 'ib_write_bw' tools, (I used CPU#0 for the test)
>
> when I run LNet-Selftest between servers, *it only shows around
> 100Gb/s(around 12GB/s, just half of maximum bandwidth)*
>
> (~12GB/s in case of read, ~13GB/s for write test)
>
>
>
> so I tried to change LNet tunables and fixed the CPT: "[0]" for IB with
> the following kernel module options.
>
> but it doesn't show big difference in lnet-self test.
>
>
>
> It seems like LNet is not fully compatible with HDR or PCIe gen4
> interfaces.
>
> Is there anyone who can give me advice why the LNet performance is not
> reaching HDR BW?
>
> or, Are there specific options or tunables that I have to modify?
>
> Please share your experience if you have setup Lustre with HDR network.
>
>
>
> Thank you.
>
>
>
> -----------lustre.conf---------------
>
> options lnet networks=o2ib0(ib0)*[0]*
>
>
>
> -----------ko2iblnd.conf-------------
>
> options ko2iblnd peer_credits=256 peer_credits_hiw=64 credits=1024
> concurrent_sends=256 ntx=2048 map_on_demand=0 fmr_pool_size=2048
> fmr_flush_trigger=512 fmr_cache=1 conns_per_peer=1
>
>
>
> ----------lnet tunables--------------
>
> tunables:
>               peer_timeout: 180
>               peer_credits: 255
>               peer_buffer_credits: 0
>               credits: 1024
>               peercredits_hiw: 127
>               map_on_demand: 0
>               concurrent_sends: 256
>               fmr_pool_size: 2048
>               fmr_flush_trigger: 512
>               fmr_cache: 1
>               ntx: 2048
>               conns_per_peer: 1
>
>
>
> I listed up some HW/SW environment that I used.
>
>
>
> ------------------------------------------------------------
>
> [Environment]
>
> - CPUs: Epyc 7302 *2 socket, supports PCIe Gen4
>
> - OS: CentOS 8.3 (kernel: 4.18.0-240.1.1.el8_lustre.x86_64)
>
> - Lustre: 2.14.0 (downloaded from repository
> https://downloads.whamcloud.com/public/lustre/lustre-2.14.0-ib/ )
>
> - OFED driver: tried 2 different versions MLNX_OFED_LINUX-5.2-1.0.4.0,
> MLNX_OFED_LINUX-5.4-1.0.3.0
>
>
>
> Finally, I used the following LNet selftest script for test.
>
> I tried to change concurrency, but the bandwidth is saturated when CN>=4
>
>
>
>
>
> ----------------------------------------------------------
>
> # Concurrency
> CN=32
> #Size
> SZ=1M
> # Length of time to run test (secs)
> TM=20
> # Which BRW test to run (read or write)
> BRW=read
> # Checksum calculation (simple or full)
> CKSUM=simple
>
> # The LST "from" list -- e.g. Lustre clients. Space separated list of NIDs.
> LFROM="192.168.8.4 at o2ib0"
> #LFROM=${LFROM:?ERROR: the LFROM variable is not set}
> # The LST "to" list -- e.g. Lustre servers. Space separated list of NIDs.
> LTO="192.168.8.6 at o2ib0"
> #LTO=${LTO:?ERROR: the LTO variable is not set}
>
> ### End of customisation.
>
> export LST_SESSION=$$
> echo LST_SESSION = ${LST_SESSION}
> lst new_session lst${BRW}
> lst add_group lfrom ${LFROM}
> lst add_group lto ${LTO}
> lst add_batch bulk_${BRW}
> lst add_test --batch bulk_${BRW} --distribute 3:1 --from lfrom --to lto
> brw ${BRW} \
>   --concurrency=${CN} check=${CKSUM} size=${SZ}
> lst run bulk_${BRW}
> echo -n "Capturing statistics for ${TM} secs "
> lst stat --mbs lfrom lto &
> LSTPID=$!
> # Delay loop with interval markers displayed every 5 secs.
> # Test time is rounded up to the nearest 5 seconds.
> i=1
> j=$((${TM}/5))
> if [ $((${TM}%5)) -ne 0 ]; then let j++; fi
> while [ $i -le $j ]; do
>   sleep 5
>   let i++
> done
> kill ${LSTPID} && wait ${LISTPID} >/dev/null 2>&1
> echo
> lst show_error lfrom lto
> lst stop bulk_${BRW}
> lst end_session
>
>
>
>
>
>
> _______________________________________________
> lustre-discuss mailing list
> lustre-discuss at lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>


-- 
Jongwoo Han
+82-505-227-6108
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20211028/81f14caa/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: noname
Type: image/gif
Size: 13402 bytes
Desc: not available
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20211028/81f14caa/attachment-0001.gif>


More information about the lustre-discuss mailing list