[lustre-discuss] Client Eviction and EIO Errors During Simulated Network Flapping (Lustre 2.15.5 + RoCE)

Fri May 30 23:28:39 PDT 2025

Continuous network failures is a very challenging environment for a network filesystem.  Even though there are server-side resends of lock callbacks, eventually the client will miss two or three callbacks and the server has no choice but to evict it from the filesystem if it wants to make progress with other client requests.

This can also cause problems for other clients, since they are waiting to get a lock that the broken client is holding, which makes the whole filesystem "hang" until the client finally gets the callback, or is evicted.

We've discussed a few potential solutions for this, but nothing has been implemented yet:
- put clients with continual network errors into the "dog house" and they cannot use the filesystem until their network is repaired, which is drastic for that client (though improves life for other clients)
- change clients with continual network errors from writeback cache to cacheless/lockless/sync, which will hurt their performance but still allow the client to access the filesystem, without impact other clients.

Cheers, Andreas

On May 29, 2025, at 00:19, zufei chen via lustre-discuss <lustre-discuss at lists.lustre.org> wrote:

I. Background:
1 Four physical nodes, each physical machine deploys 2 virtual machines: lustre-mds-nodexx (containing 2 MDTs internally) and lustre-oss-nodexx (containing 8 OSTs and an MGS in one of them).
2 Two RoCE network interfaces, ens6f0np0 and ens6f1np1, on the physical machines are virtualized and passed through to the virtual machines (service1 and service2).
3 Using Lustre version 2.15.5 with Pacemaker.
4 A client is running vdbench workloads.
5 Simulating network interface flapping on ens6f0np0 on one of the physical nodes using the following script:
for i in {1..10}; do ifconfig ens6f0np0 down; sleep 20; ifconfig ens6f0np0 up; sleep 30; done

II. Problem:
1 After running the network flapping script for a while, the business experiences EIO errors, leading to service interruption.
2 This issue is almost reproducible every time.

III. Preliminary Analysis:
The issue is suspected to be caused by lock callback timeouts, which lead to the server evicting the client.

IV. Relevant Logs:
Server:
May 27 12:09:19 lustre-oss-node40 kernel: LustreError: 13958:0:(ldlm_lockd.c:261:expired_lock_main()) ### lock callback timer expired after 268s: evicting client at 10.255.153.118 at o2ib  ns: filter-PFStest-OST0005_UUID lock: 00000000d705f0d0/0x7bcb4583f93039cb
        lrc: 3/0,0 mode: PR/PR res: [0x6936:0x0:0x0].0x0 rrc: 3 type: EXT [0->18446744073709551615] (req 0->1073741823) gid 0 flags: 0x60000400000020 nid: 10.255.153.118 at o2ib remote: 0x977d715b44c72ae8 expref: 12723 pid: 14457 timeout: 60814 lvb_type: 1

Client:
May 27 12:09:27 rocky9vm2 kernel: Lustre: PFStest-OST0005-osc-ff49d5028d989800: Connection to PFStest-OST0005 (at 10.255.153.242 at o2ib) was lost; in-progress operations using this service will wait for recovery to complete.

V. Additional Information
IP Configuration in Virtual Machines:
| Virtual Machine   | Service  | IP Address     |
| ----------------- | -------- | -------------- |
| lustre-mds-node32 | service1 | 10.255.153.236 |
|                   | service2 | 10.255.153.237 |
| lustre-oss-node32 | service1 | 10.255.153.238 |
|                   | service2 | 10.255.153.239 |
| lustre-mds-node40 | service1 | 10.255.153.240 |
|                   | service2 | 10.255.153.241 |
| lustre-oss-node40 | service1 | 10.255.153.242 |
|                   | service2 | 10.255.153.243 |
| lustre-mds-node41 | service1 | 10.255.153.244 |
|                   | service2 | 10.255.153.245 |
| lustre-oss-node41 | service1 | 10.255.153.246 |
|                   | service2 | 10.255.153.247 |
| lustre-mds-node42 | service1 | 10.255.153.248 |
|                   | service2 | 10.255.153.249 |
| lustre-oss-node42 | service1 | 10.255.153.250 |
|                   | service2 | 10.255.153.251 |

2 Policy Routing Configuration on Server (Example: lustre-oss-node40):

cat /etc/iproute2/rt_tables
#
# reserved values
#
255     local
254     main
253     default
0       unspec
#
# local
#
#1      inr.ruhep
263     service1
271     service2

[root at lustre-oss-node40 ~]# ip route show table service1
10.255.153.0/24<http://10.255.153.0/24> dev service1 scope link src 10.255.153.242
[root at lustre-oss-node40 ~]# ip route show table service2
10.255.153.0/24<http://10.255.153.0/24> dev service2 scope link src 10.255.153.243
[root at lustre-oss-node40 ~]# ip rule list
0:      from all lookup local
32764:  from 10.255.153.243 lookup service2
32765:  from 10.255.153.242 lookup service1
32766:  from all lookup main
32767:  from all lookup default
[root at lustre-oss-node40 ~]# ip route
10.255.153.0/24<http://10.255.153.0/24> dev service2 proto kernel scope link src 10.255.153.243 metric 101
10.255.153.0/24<http://10.255.153.0/24> dev service1 proto kernel scope link src 10.255.153.242 metric 102

3 /etc/modprobe.d/lustre.conf:
options lnet networks="o2ib(service2)[0,1],o2ib(service1)[0,1]"
options libcfs cpu_npartitions=2
options mdt max_mod_rpcs_per_client=128
options mdt mds_io_num_cpts=[0,1]
options mdt mds_num_cpts=[0,1]
options mdt mds_rdpg_num_cpts=[0,1]
options mds mds_num_threads=512
options ost oss_num_threads=512
options ost oss_cpts=[0,1]
options ost oss_io_cpts=[0,1]
options lnet portal_rotor=1
options lnet lnet_recovery_limit=10
options ptlrpc ldlm_enqueue_min=260

VI. Other Attempts
1 Reduced LNet Timeout and Increased Retry Count:
Both server and client have reduced the LNet timeout and increased the retry count, but the issue persists.
lnetctl set transaction_timeout 10
lnetctl set retry_count 3
lnetctl set health_sensitivity 1

2 Set Recovery Limit:
Both server and client have set the recovery limit, but the issue persists.
lnetctl set recovery_limit 10

3 Simulated Network flapping Using iptables:
Simulated network flapping using iptables in the virtual machines, but the issue persists.
#!/bin/bash
for j in {1..1000}; do
    date;
    echo -e "\nIteration $j: Starting single-port network flapping\n";
    for i in {1..10}; do
        echo -e " ==== Iteration $i down ===="; date;
        sudo iptables -I INPUT 1 -i service1 -j DROP;
        sudo iptables -I OUTPUT 1 -o service1 -j DROP;
        sleep 20;
        echo -e " ==== Iteration $i up ===="; date;
        sudo iptables -D INPUT -i service1 -j DROP;
        sudo iptables -D OUTPUT -o service1 -j DROP;
        sleep 30s;
    done
    echo -e "\nIteration $j: Ending single-port network flapping\n"; date;
    sudo iptables -L INPUT -v | grep -i service1
    sudo iptables -L OUTPUT -v | grep -i service1
    sleep 120;
done

VII. Any Suggestions?
Dear all, I would appreciate any suggestions or insights you might have regarding this issue. Thank you!
_______________________________________________
lustre-discuss mailing list
lustre-discuss at lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20250531/799f8277/attachment-0001.htm>