[lustre-discuss] Client Eviction and EIO Errors During Simulated Network Flapping (Lustre 2.15.5 + RoCE)

Wed May 28 23:18:01 PDT 2025

I. Background:
1 Four physical nodes, each physical machine deploys 2 virtual machines:
lustre-mds-nodexx (containing 2 MDTs internally) and lustre-oss-nodexx
(containing 8 OSTs and an MGS in one of them).
2 Two RoCE network interfaces, ens6f0np0 and ens6f1np1, on the physical
machines are virtualized and passed through to the virtual machines
(service1 and service2).
3 Using Lustre version 2.15.5 with Pacemaker.
4 A client is running vdbench workloads.
5 Simulating network interface flapping on ens6f0np0 on one of the physical
nodes using the following script:
for i in {1..10}; do ifconfig ens6f0np0 down; sleep 20; ifconfig ens6f0np0
up; sleep 30; done

II. Problem:
1 After running the network flapping script for a while, the business
experiences EIO errors, leading to service interruption.
2 This issue is almost reproducible every time.

III. Preliminary Analysis:
The issue is suspected to be caused by lock callback timeouts, which lead
to the server evicting the client.

IV. Relevant Logs:
Server:
May 27 12:09:19 lustre-oss-node40 kernel: LustreError:
13958:0:(ldlm_lockd.c:261:expired_lock_main()) ### lock callback timer
expired after 268s: evicting client at 10.255.153.118 at o2ib  ns:
filter-PFStest-OST0005_UUID lock: 00000000d705f0d0/0x7bcb4583f93039cb
        lrc: 3/0,0 mode: PR/PR res: [0x6936:0x0:0x0].0x0 rrc: 3 type: EXT
[0->18446744073709551615] (req 0->1073741823) gid 0 flags: 0x60000400000020
nid: 10.255.153.118 at o2ib remote: 0x977d715b44c72ae8 expref: 12723 pid:
14457 timeout: 60814 lvb_type: 1

Client:
May 27 12:09:27 rocky9vm2 kernel: Lustre:
PFStest-OST0005-osc-ff49d5028d989800: Connection to PFStest-OST0005 (at
10.255.153.242 at o2ib) was lost; in-progress operations using this service
will wait for recovery to complete.

V. Additional Information
IP Configuration in Virtual Machines:
| Virtual Machine   | Service  | IP Address     |
| ----------------- | -------- | -------------- |
| lustre-mds-node32 | service1 | 10.255.153.236 |
|                   | service2 | 10.255.153.237 |
| lustre-oss-node32 | service1 | 10.255.153.238 |
|                   | service2 | 10.255.153.239 |
| lustre-mds-node40 | service1 | 10.255.153.240 |
|                   | service2 | 10.255.153.241 |
| lustre-oss-node40 | service1 | 10.255.153.242 |
|                   | service2 | 10.255.153.243 |
| lustre-mds-node41 | service1 | 10.255.153.244 |
|                   | service2 | 10.255.153.245 |
| lustre-oss-node41 | service1 | 10.255.153.246 |
|                   | service2 | 10.255.153.247 |
| lustre-mds-node42 | service1 | 10.255.153.248 |
|                   | service2 | 10.255.153.249 |
| lustre-oss-node42 | service1 | 10.255.153.250 |
|                   | service2 | 10.255.153.251 |

2 Policy Routing Configuration on Server (Example: lustre-oss-node40):

cat /etc/iproute2/rt_tables
#
# reserved values
#
255     local
254     main
253     default
0       unspec
#
# local
#
#1      inr.ruhep
263     service1
271     service2

[root at lustre-oss-node40 ~]# ip route show table service1
10.255.153.0/24 dev service1 scope link src 10.255.153.242
[root at lustre-oss-node40 ~]# ip route show table service2
10.255.153.0/24 dev service2 scope link src 10.255.153.243
[root at lustre-oss-node40 ~]# ip rule list
0:      from all lookup local
32764:  from 10.255.153.243 lookup service2
32765:  from 10.255.153.242 lookup service1
32766:  from all lookup main
32767:  from all lookup default
[root at lustre-oss-node40 ~]# ip route
10.255.153.0/24 dev service2 proto kernel scope link src 10.255.153.243
metric 101
10.255.153.0/24 dev service1 proto kernel scope link src 10.255.153.242
metric 102

3 /etc/modprobe.d/lustre.conf:
options lnet networks="o2ib(service2)[0,1],o2ib(service1)[0,1]"
options libcfs cpu_npartitions=2
options mdt max_mod_rpcs_per_client=128
options mdt mds_io_num_cpts=[0,1]
options mdt mds_num_cpts=[0,1]
options mdt mds_rdpg_num_cpts=[0,1]
options mds mds_num_threads=512
options ost oss_num_threads=512
options ost oss_cpts=[0,1]
options ost oss_io_cpts=[0,1]
options lnet portal_rotor=1
options lnet lnet_recovery_limit=10
options ptlrpc ldlm_enqueue_min=260

VI. Other Attempts
1 Reduced LNet Timeout and Increased Retry Count:
Both server and client have reduced the LNet timeout and increased the
retry count, but the issue persists.
lnetctl set transaction_timeout 10
lnetctl set retry_count 3
lnetctl set health_sensitivity 1

2 Set Recovery Limit:
Both server and client have set the recovery limit, but the issue persists.
lnetctl set recovery_limit 10

3 Simulated Network flapping Using iptables:
Simulated network flapping using iptables in the virtual machines, but the
issue persists.
#!/bin/bash
for j in {1..1000}; do
    date;
    echo -e "\nIteration $j: Starting single-port network flapping\n";
    for i in {1..10}; do
        echo -e " ==== Iteration $i down ===="; date;
        sudo iptables -I INPUT 1 -i service1 -j DROP;
        sudo iptables -I OUTPUT 1 -o service1 -j DROP;
        sleep 20;
        echo -e " ==== Iteration $i up ===="; date;
        sudo iptables -D INPUT -i service1 -j DROP;
        sudo iptables -D OUTPUT -o service1 -j DROP;
        sleep 30s;
    done
    echo -e "\nIteration $j: Ending single-port network flapping\n"; date;
    sudo iptables -L INPUT -v | grep -i service1
    sudo iptables -L OUTPUT -v | grep -i service1
    sleep 120;
done

VII. Any Suggestions?
Dear all, I would appreciate any suggestions or insights you might have
regarding this issue. Thank you!
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20250529/2f8da8d5/attachment.htm>