<div dir="ltr"><br><div>I. Background:<br>1 Four physical nodes, each physical machine deploys 2 virtual machines: lustre-mds-nodexx (containing 2 MDTs internally) and lustre-oss-nodexx (containing 8 OSTs and an MGS in one of them).<br>2 Two RoCE network interfaces, ens6f0np0 and ens6f1np1, on the physical machines are virtualized and passed through to the virtual machines (service1 and service2).<br>3 Using Lustre version 2.15.5 with Pacemaker.<br>4 A client is running vdbench workloads.<br>5 Simulating network interface flapping on ens6f0np0 on one of the physical nodes using the following script:<br>for i in {1..10}; do ifconfig ens6f0np0 down; sleep 20; ifconfig ens6f0np0 up; sleep 30; done<br><br>II. Problem:<br>1 After running the network flapping script for a while, the business experiences EIO errors, leading to service interruption.<br>2 This issue is almost reproducible every time.<br><br>III. Preliminary Analysis:<br>The issue is suspected to be caused by lock callback timeouts, which lead to the server evicting the client.<br><br>IV. Relevant Logs:<br>Server:<br>May 27 12:09:19 lustre-oss-node40 kernel: LustreError: 13958:0:(ldlm_lockd.c:261:expired_lock_main()) ### lock callback timer expired after 268s: evicting client at 10.255.153.118@o2ib  ns: filter-PFStest-OST0005_UUID lock: 00000000d705f0d0/0x7bcb4583f93039cb <br>        lrc: 3/0,0 mode: PR/PR res: [0x6936:0x0:0x0].0x0 rrc: 3 type: EXT [0->18446744073709551615] (req 0->1073741823) gid 0 flags: 0x60000400000020 nid: 10.255.153.118@o2ib remote: 0x977d715b44c72ae8 expref: 12723 pid: 14457 timeout: 60814 lvb_type: 1<br><br>Client:<br>May 27 12:09:27 rocky9vm2 kernel: Lustre: PFStest-OST0005-osc-ff49d5028d989800: Connection to PFStest-OST0005 (at 10.255.153.242@o2ib) was lost; in-progress operations using this service will wait for recovery to complete.<br><br>V. Additional Information<br>IP Configuration in Virtual Machines:<br>| Virtual Machine   | Service  | IP Address     |<br>| ----------------- | -------- | -------------- |<br>| lustre-mds-node32 | service1 | 10.255.153.236 |<br>|                   | service2 | 10.255.153.237 |<br>| lustre-oss-node32 | service1 | 10.255.153.238 |<br>|                   | service2 | 10.255.153.239 |<br>| lustre-mds-node40 | service1 | 10.255.153.240 |<br>|                   | service2 | 10.255.153.241 |<br>| lustre-oss-node40 | service1 | 10.255.153.242 |<br>|                   | service2 | 10.255.153.243 |<br>| lustre-mds-node41 | service1 | 10.255.153.244 |<br>|                   | service2 | 10.255.153.245 |<br>| lustre-oss-node41 | service1 | 10.255.153.246 |<br>|                   | service2 | 10.255.153.247 |<br>| lustre-mds-node42 | service1 | 10.255.153.248 |<br>|                   | service2 | 10.255.153.249 |<br>| lustre-oss-node42 | service1 | 10.255.153.250 |<br>|                   | service2 | 10.255.153.251 |<br><br>2 Policy Routing Configuration on Server (Example: lustre-oss-node40):<br><br>cat /etc/iproute2/rt_tables<br>#<br># reserved values<br>#<br>255     local<br>254     main<br>253     default<br>0       unspec<br>#<br># local<br>#<br>#1      inr.ruhep<br>263     service1<br>271     service2<br><br>[root@lustre-oss-node40 ~]# ip route show table service1<br><a href="http://10.255.153.0/24">10.255.153.0/24</a> dev service1 scope link src 10.255.153.242<br>[root@lustre-oss-node40 ~]# ip route show table service2<br><a href="http://10.255.153.0/24">10.255.153.0/24</a> dev service2 scope link src 10.255.153.243<br>[root@lustre-oss-node40 ~]# ip rule list<br>0:      from all lookup local<br>32764:  from 10.255.153.243 lookup service2<br>32765:  from 10.255.153.242 lookup service1<br>32766:  from all lookup main<br>32767:  from all lookup default<br>[root@lustre-oss-node40 ~]# ip route<br><a href="http://10.255.153.0/24">10.255.153.0/24</a> dev service2 proto kernel scope link src 10.255.153.243 metric 101<br><a href="http://10.255.153.0/24">10.255.153.0/24</a> dev service1 proto kernel scope link src 10.255.153.242 metric 102<br><br>3 /etc/modprobe.d/lustre.conf:<br>options lnet networks="o2ib(service2)[0,1],o2ib(service1)[0,1]"<br>options libcfs cpu_npartitions=2<br>options mdt max_mod_rpcs_per_client=128<br>options mdt mds_io_num_cpts=[0,1]<br>options mdt mds_num_cpts=[0,1]<br>options mdt mds_rdpg_num_cpts=[0,1]<br>options mds mds_num_threads=512<br>options ost oss_num_threads=512<br>options ost oss_cpts=[0,1]<br>options ost oss_io_cpts=[0,1]<br>options lnet portal_rotor=1<br>options lnet lnet_recovery_limit=10<br>options ptlrpc ldlm_enqueue_min=260<br><br>VI. Other Attempts<br>1 Reduced LNet Timeout and Increased Retry Count:<br>Both server and client have reduced the LNet timeout and increased the retry count, but the issue persists.<br>lnetctl set transaction_timeout 10<br>lnetctl set retry_count 3<br>lnetctl set health_sensitivity 1<br><br>2 Set Recovery Limit:<br>Both server and client have set the recovery limit, but the issue persists.<br>lnetctl set recovery_limit 10<br><br>3 Simulated Network flapping Using iptables:<br>Simulated network flapping using iptables in the virtual machines, but the issue persists.<br>#!/bin/bash<br>for j in {1..1000}; do<br>    date;<br>    echo -e "\nIteration $j: Starting single-port network flapping\n";<br>    for i in {1..10}; do<br>        echo -e " ==== Iteration $i down ===="; date;<br>        sudo iptables -I INPUT 1 -i service1 -j DROP;<br>        sudo iptables -I OUTPUT 1 -o service1 -j DROP;<br>        sleep 20;<br>        echo -e " ==== Iteration $i up ===="; date;<br>        sudo iptables -D INPUT -i service1 -j DROP;<br>        sudo iptables -D OUTPUT -o service1 -j DROP;<br>        sleep 30s;<br>    done<br>    echo -e "\nIteration $j: Ending single-port network flapping\n"; date;<br>    sudo iptables -L INPUT -v | grep -i service1<br>    sudo iptables -L OUTPUT -v | grep -i service1<br>    sleep 120;<br>done<br><br>VII. Any Suggestions?<br>Dear all, I would appreciate any suggestions or insights you might have regarding this issue. Thank you!</div></div>