Hi everyone,
I hope this message find you well.
I am unable to get Lustre RoCE v2 traffic to carry specific DSCP/TOS tags. While synthetic tests (ib_send_bw) successfully hit the desired hardware priority queues by selecting a specific GID index, Lustre traffic remains stuck at tos 0x1 (ECN enabled, DSCP 0), causing it to be mapped to the default Unicast queue (UC0) rather than the Lossless queue (UC3) on our SONiC switches.
System environment:
OS: Rocky linux 9.6 (5.14.0-570.17.1.el9_6.x86_64)
Lustre: 2.15.7
NIC:
driver: bnxt_en
version: 1.10.3-233.0.198.0
firmware-version: 233.0.196.0/pkg 23.31.18.10
expansion-rom-version:
bus-info: 0000:21:00.1
supports-statistics: yes
supports-test: yes
supports-eeprom-access: yes
supports-register-dump: yes
supports-priv-flags: yes
Switch:
Software Version : 4.5.0a-Enterprise_Premium
Product : Enterprise SONiC Distribution by Dell Technologies
Distribution : Debian 11.11
Kernel : 5.10.0-21-amd64
Config DB Version : version_4_5_2
lnet:
net:
- net type: lo
local NI(s):
- nid: 0 at lo
status: up
statistics:
send_count: 0
recv_count: 0
drop_count: 0
tunables:
peer_timeout: 0
peer_credits: 0
peer_buffer_credits: 0
credits: 0
lnd tunables:
dev cpt: 0
CPT: "[0,1,2,3,4,5,6,7]"
- net type: o2ib1
local NI(s):
- nid: 172.16.7.13 at o2ib1
status: up
interfaces:
0: ens1f1np1
statistics:
send_count: 2314
recv_count: 4361
drop_count: 0
tunables:
peer_timeout: 180
peer_credits: 128
peer_buffer_credits: 0
credits: 1024
lnd tunables:
peercredits_hiw: 64
map_on_demand: 1
concurrent_sends: 128
fmr_pool_size: 512
fmr_flush_trigger: 384
fmr_cache: 1
ntx: 512
conns_per_peer: 1
dev cpt: 2
CPT: "[0,1,2,3,4,5,6,7]"
Troubleshooting Steps Already Taken
Manual TOS Overwrite: Attempted cma_roce_tos -d bnxt_re1 -t 104. Command returns successfully, but tcpdump confirms outgoing Lustre packets still carry tos 0x1.
Kernel Mangle Bypass: Applied nftables (mangle table) rules to force DSCP 26 on UDP port 4791. Traffic remains 0x1, suggesting hardware offload bypasses the Linux network stack.
Synthetic Success: Using ib_send_bw -x 3 (selecting GID Index 3) successfully changes the hardware queue and tagging. This proves the hardware is capable, but the Lustre kernel module isn't utilizing the correct GID index or TOS.
Through all the investigation, I think Lustre (LNet and/or ko2iblnd) is not tagging the packets correctly, and I cannot find how to set it to use ToS 0x69. On the switch, it is still using UC0. I think the problem is with Lustre because if I use the ib_send_bw -x 3 command, it does go through UC3.
I would appreciate it if someone could give me some guidance to solve this problem.
Thank you in advance.
Warm regards,
Victor
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20260429/a860d7fc/attachment.htm>