[lustre-discuss] Lustre RoCE v2 traffic issue

VICTOR MANUEL MINJARES NERIZ victor.minjares at unison.mx
Tue May 5 15:17:40 PDT 2026


Hello Wangshuo,
I apologize if I made a mistake with your name.
I appreciate the response. However, I am a little bit afraid of changing and recompiling the drivers; there are other programs, and I do not know how it will affect them if I make that change. I would like to try other options first before recompiling.
Warm regards,
Victor
________________________________
De: wangshuobin <w14767780617 at 163.com>
Enviado: jueves, 30 de abril de 2026 12:15 a. m.
Para: VICTOR MANUEL MINJARES NERIZ <victor.minjares at unison.mx>
Asunto: Re:[lustre-discuss] Lustre RoCE v2 traffic issue

No suele recibir correo electrónico de w14767780617 at 163.com. Por qué es esto importante<https://aka.ms/LearnAboutSenderIdentification>
hello,


I’ve noticed a similar issue:

On the server side I’m using an MLX NIC, and on the client side an Intel E810 NIC. When the RDMA connection is initiated by the E810 NIC, even if the priority is set to 3, the traffic still goes through priority 0.

However, in the opposite case―when the MLX NIC initiates the RDMA connection―it can use priority 0 as expected.

You can try modifying the kernel source file cma.c by commenting out the following code:

static int get_vlan_ndev_tc(struct net_device *vlan_ndev, int prio)

{

        struct net_device *dev;

        int tc;



        dev = vlan_dev_real_dev(vlan_ndev);

        pr_info("cma: get_vlan_ndev_tc: vlan_ndev=%s, real_dev=%s, prio=%d, num_tc=%d\n",

                vlan_ndev->name, dev->name, prio, dev->num_tc);



        //if (dev->num_tc) {

        //      tc = netdev_get_prio_tc_map(dev, prio);

        //      pr_info("cma: get_vlan_ndev_tc: path=netdev_get_prio_tc_map, tc=%d\n", tc);

        //      return tc;

        //}



        tc = (vlan_dev_get_egress_qos_mask(vlan_ndev, prio) &

              VLAN_PRIO_MASK) >> VLAN_PRIO_SHIFT;

        pr_info("cma: new get_vlan_ndev_tc: path=vlan_egress_qos, egress_mask=0x%x, tc=%d\n",

                vlan_dev_get_egress_qos_mask(vlan_ndev, prio), tc);

        return tc;

}



Then recompile and reload it on the client where the E810 NIC is installed and test again.

My NIC is an E810. Since you’re using a Broadcom NIC, it might be a similar issue, so you can try this as a reference.



The root cause is that the priority used by RDMA is determined by the CM connection request.

When the E810 NIC initiates an RDMA connection, its dev->num_tc default value is 1 (instead of 0). As a result, in the get_vlan_ndev_tc() function, the code follows the netdev_get_prio_tc_map branch. This ultimately causes the PRIMARY_SL field in the transmitted CM REQ packet to be set to 0.

The MLX responder then uses this value to set the priority, which results in the traffic going through priority 0.





At 2026-04-30 05:52:26, "VICTOR MANUEL MINJARES NERIZ via lustre-discuss" <lustre-discuss at lists.lustre.org> wrote:

Hi everyone,

I hope this message find you well.

I am unable to get Lustre RoCE v2 traffic to carry specific DSCP/TOS tags. While synthetic tests (ib_send_bw) successfully hit the desired hardware priority queues by selecting a specific GID index, Lustre traffic remains stuck at tos 0x1 (ECN enabled, DSCP 0), causing it to be mapped to the default Unicast queue (UC0) rather than the Lossless queue (UC3) on our SONiC switches.

System environment:

OS: Rocky linux 9.6 (5.14.0-570.17.1.el9_6.x86_64)

Lustre: 2.15.7

NIC:
driver: bnxt_en
version: 1.10.3-233.0.198.0
firmware-version: 233.0.196.0/pkg 23.31.18.10
expansion-rom-version:
bus-info: 0000:21:00.1
supports-statistics: yes
supports-test: yes
supports-eeprom-access: yes
supports-register-dump: yes
supports-priv-flags: yes

Switch:
Software Version  : 4.5.0a-Enterprise_Premium
Product           : Enterprise SONiC Distribution by Dell Technologies
Distribution      : Debian 11.11
Kernel            : 5.10.0-21-amd64
Config DB Version : version_4_5_2

lnet:
net:
   - net type: lo
     local NI(s):
- nid: 0 at lo
 status: up
 statistics:
     send_count: 0
     recv_count: 0
     drop_count: 0
 tunables:
     peer_timeout: 0
     peer_credits: 0
     peer_buffer_credits: 0
     credits: 0
 lnd tunables:
 dev cpt: 0
 CPT: "[0,1,2,3,4,5,6,7]"
   - net type: o2ib1
     local NI(s):
- nid: 172.16.7.13 at o2ib1
 status: up
 interfaces:
     0: ens1f1np1
 statistics:
     send_count: 2314
     recv_count: 4361
     drop_count: 0
 tunables:
     peer_timeout: 180
     peer_credits: 128
     peer_buffer_credits: 0
     credits: 1024
 lnd tunables:
     peercredits_hiw: 64
     map_on_demand: 1
     concurrent_sends: 128
     fmr_pool_size: 512
     fmr_flush_trigger: 384
     fmr_cache: 1
     ntx: 512
     conns_per_peer: 1
 dev cpt: 2
 CPT: "[0,1,2,3,4,5,6,7]"

Troubleshooting Steps Already Taken

Manual TOS Overwrite: Attempted cma_roce_tos -d bnxt_re1 -t 104. Command returns successfully, but tcpdump confirms outgoing Lustre packets still carry tos 0x1.

Kernel Mangle Bypass: Applied nftables (mangle table) rules to force DSCP 26 on UDP port 4791. Traffic remains 0x1, suggesting hardware offload bypasses the Linux network stack.

Synthetic Success: Using ib_send_bw -x 3 (selecting GID Index 3) successfully changes the hardware queue and tagging. This proves the hardware is capable, but the Lustre kernel module isn't utilizing the correct GID index or TOS.

Through all the investigation, I think Lustre (LNet and/or ko2iblnd) is not tagging the packets correctly, and I cannot find how to set it to use ToS 0x69. On the switch, it is still using UC0. I think the problem is with Lustre because if I use the ib_send_bw -x 3 command, it does go through UC3.

I would appreciate it if someone could give me some guidance to solve this problem.

Thank you in advance.

Warm regards,
Victor

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20260505/1d19b684/attachment-0001.htm>


More information about the lustre-discuss mailing list