<html><head><meta http-equiv="content-type" content="text/html; charset=UTF-8"><style>body { line-height: 1.5; }blockquote { margin-top: 0px; margin-bottom: 0px; margin-left: 0.5em; }div.FoxDiv20250602001603064702 { }body { font-size: 14px; font-family: "Microsoft YaHei UI"; color: rgb(0, 0, 0); line-height: 1.5; }</style></head><body>
<div><span></span><br></div><div><div class="paragraph" style="font-family: -apple-system, BlinkMacSystemFont, "Segoe UI", system-ui, -apple-system, "Segoe UI", Roboto, Ubuntu, Cantarell, "Noto Sans", sans-serif, Arial, "PingFang SC", "Source Han Sans SC", "Microsoft YaHei UI", "Microsoft YaHei", "Noto Sans CJK SC", sans-serif; margin: 0px 0px 16px; padding: 0px; border: 0px; font-variant-numeric: inherit; font-variant-east-asian: inherit; font-stretch: inherit; font-size: 16px; line-height: 26px; font-optical-sizing: inherit; font-kerning: inherit; font-feature-settings: inherit; font-variation-settings: inherit; vertical-align: baseline; letter-spacing: 0px; max-width: 100%; white-space: pre-wrap; word-break: break-word; color: rgba(0, 0, 0, 0.9);">Thank you for your response.</div><div class="paragraph" style="font-family: -apple-system, BlinkMacSystemFont, "Segoe UI", system-ui, -apple-system, "Segoe UI", Roboto, Ubuntu, Cantarell, "Noto Sans", sans-serif, Arial, "PingFang SC", "Source Han Sans SC", "Microsoft YaHei UI", "Microsoft YaHei", "Noto Sans CJK SC", sans-serif; margin: 0px 0px 16px; padding: 0px; border: 0px; font-variant-numeric: inherit; font-variant-east-asian: inherit; font-stretch: inherit; font-size: 16px; line-height: 26px; font-optical-sizing: inherit; font-kerning: inherit; font-feature-settings: inherit; font-variation-settings: inherit; vertical-align: baseline; letter-spacing: 0px; max-width: 100%; white-space: pre-wrap; word-break: break-word; color: rgba(0, 0, 0, 0.9);">May I ask whether the retry for failed lock callback messages is handled by the lnet layer or the lock module itself? If it's the lnet layer that does the retrying, then theoretically, in the case of a <span style="background-color: transparent;">single </span><span style="letter-spacing: 0px; background-color: transparent;">network interface flap, the retry should eventually succeed. For the lock module, this would just mean a bit of a delay, right?</span></div><div class="paragraph" style="font-family: -apple-system, BlinkMacSystemFont, "Segoe UI", system-ui, -apple-system, "Segoe UI", Roboto, Ubuntu, Cantarell, "Noto Sans", sans-serif, Arial, "PingFang SC", "Source Han Sans SC", "Microsoft YaHei UI", "Microsoft YaHei", "Noto Sans CJK SC", sans-serif; margin: 0px; padding: 0px; border: 0px; font-variant-numeric: inherit; font-variant-east-asian: inherit; font-stretch: inherit; font-size: 16px; line-height: 26px; font-optical-sizing: inherit; font-kerning: inherit; font-feature-settings: inherit; font-variation-settings: inherit; vertical-align: baseline; letter-spacing: 0px; max-width: 100%; white-space: pre-wrap; word-break: break-word; color: rgba(0, 0, 0, 0.9);">Also, what's your opinion on whether we should configure the two network interfaces in bond4 mode to address the issue of a single network interface flapping?</div></div>
<div><br></div><hr style="width: 210px; height: 1px;" color="#b5c4df" size="1" align="left">
<div><span><div style="MARGIN: 10px; FONT-FAMILY: verdana; FONT-SIZE: 10pt"><div>chenzufei@gmail.com</div></div></span></div>
<blockquote style="margin-Top: 0px; margin-Bottom: 0px; margin-Left: 0.5em; margin-Right: inherit"><div> </div><div style="border:none;border-top:solid #B5C4DF 1.0pt;padding:3.0pt 0cm 0cm 0cm"><div style="PADDING-RIGHT: 8px; PADDING-LEFT: 8px; FONT-SIZE: 12px;FONT-FAMILY:tahoma;COLOR:#000000; BACKGROUND: #efefef; PADDING-BOTTOM: 8px; PADDING-TOP: 8px"><div><b>From:</b> <a href="mailto:adilger@ddn.com">Andreas Dilger</a></div><div><b>Date:</b> 2025-05-31 14:28</div><div><b>To:</b> <a href="mailto:chenzufei@gmail.com">zufei chen</a></div><div><b>CC:</b> <a href="mailto:lustre-discuss@lists.lustre.org">lustre-discuss</a></div><div><b>Subject:</b> Re: [lustre-discuss] Client Eviction and EIO Errors During Simulated Network Flapping (Lustre 2.15.5 + RoCE)</div></div></div><div><div class="FoxDiv20250602001603064702">
Continuous network failures is a very challenging environment for a network filesystem. Even though there are server-side resends of lock callbacks, eventually the client will miss two or three callbacks and the server has no choice but to evict it from the
filesystem if it wants to make progress with other client requests.
<div><br>
</div>
<div>This can also cause problems for other clients, since they are waiting to get a lock that the broken client is holding, which makes the whole filesystem "hang" until the client finally gets the callback, or is evicted. <br>
<div><br>
</div>
<div>We've discussed a few potential solutions for this, but nothing has been implemented yet:</div>
<div>- put clients with continual network errors into the "dog house" and they cannot use the filesystem until their network is repaired, which is drastic for that client (though improves life for other clients)</div>
<div>- change clients with continual network errors from writeback cache to cacheless/lockless/sync, which will hurt their performance but still allow the client to access the filesystem, without impact other clients. <br>
<div><br id="lineBreakAtBeginningOfSignature">
<div dir="ltr">Cheers, Andreas</div>
<div dir="ltr"><br>
<blockquote type="cite" style="margin-top: 0px;">On May 29, 2025, at 00:19, zufei chen via lustre-discuss <lustre-discuss@lists.lustre.org> wrote:<br>
<br>
</blockquote>
</div>
<blockquote type="cite" style="margin-top: 0px;">
<div dir="ltr">
<div dir="ltr"><br>
<div>I. Background:<br>
1 Four physical nodes, each physical machine deploys 2 virtual machines: lustre-mds-nodexx (containing 2 MDTs internally) and lustre-oss-nodexx (containing 8 OSTs and an MGS in one of them).<br>
2 Two RoCE network interfaces, ens6f0np0 and ens6f1np1, on the physical machines are virtualized and passed through to the virtual machines (service1 and service2).<br>
3 Using Lustre version 2.15.5 with Pacemaker.<br>
4 A client is running vdbench workloads.<br>
5 Simulating network interface flapping on ens6f0np0 on one of the physical nodes using the following script:<br>
for i in {1..10}; do ifconfig ens6f0np0 down; sleep 20; ifconfig ens6f0np0 up; sleep 30; done<br>
<br>
II. Problem:<br>
1 After running the network flapping script for a while, the business experiences EIO errors, leading to service interruption.<br>
2 This issue is almost reproducible every time.<br>
<br>
III. Preliminary Analysis:<br>
The issue is suspected to be caused by lock callback timeouts, which lead to the server evicting the client.<br>
<br>
IV. Relevant Logs:<br>
Server:<br>
May 27 12:09:19 lustre-oss-node40 kernel: LustreError: 13958:0:(ldlm_lockd.c:261:expired_lock_main()) ### lock callback timer expired after 268s: evicting client at 10.255.153.118@o2ib ns: filter-PFStest-OST0005_UUID lock: 00000000d705f0d0/0x7bcb4583f93039cb
<br>
lrc: 3/0,0 mode: PR/PR res: [0x6936:0x0:0x0].0x0 rrc: 3 type: EXT [0->18446744073709551615] (req 0->1073741823) gid 0 flags: 0x60000400000020 nid: 10.255.153.118@o2ib remote: 0x977d715b44c72ae8 expref: 12723 pid: 14457 timeout: 60814 lvb_type: 1<br>
<br>
Client:<br>
May 27 12:09:27 rocky9vm2 kernel: Lustre: PFStest-OST0005-osc-ff49d5028d989800: Connection to PFStest-OST0005 (at 10.255.153.242@o2ib) was lost; in-progress operations using this service will wait for recovery to complete.<br>
<br>
V. Additional Information<br>
IP Configuration in Virtual Machines:<br>
| Virtual Machine | Service | IP Address |<br>
| ----------------- | -------- | -------------- |<br>
| lustre-mds-node32 | service1 | 10.255.153.236 |<br>
| | service2 | 10.255.153.237 |<br>
| lustre-oss-node32 | service1 | 10.255.153.238 |<br>
| | service2 | 10.255.153.239 |<br>
| lustre-mds-node40 | service1 | 10.255.153.240 |<br>
| | service2 | 10.255.153.241 |<br>
| lustre-oss-node40 | service1 | 10.255.153.242 |<br>
| | service2 | 10.255.153.243 |<br>
| lustre-mds-node41 | service1 | 10.255.153.244 |<br>
| | service2 | 10.255.153.245 |<br>
| lustre-oss-node41 | service1 | 10.255.153.246 |<br>
| | service2 | 10.255.153.247 |<br>
| lustre-mds-node42 | service1 | 10.255.153.248 |<br>
| | service2 | 10.255.153.249 |<br>
| lustre-oss-node42 | service1 | 10.255.153.250 |<br>
| | service2 | 10.255.153.251 |<br>
<br>
2 Policy Routing Configuration on Server (Example: lustre-oss-node40):<br>
<br>
cat /etc/iproute2/rt_tables<br>
#<br>
# reserved values<br>
#<br>
255 local<br>
254 main<br>
253 default<br>
0 unspec<br>
#<br>
# local<br>
#<br>
#1 inr.ruhep<br>
263 service1<br>
271 service2<br>
<br>
[root@lustre-oss-node40 ~]# ip route show table service1<br>
<a href="http://10.255.153.0/24">10.255.153.0/24</a> dev service1 scope link src 10.255.153.242<br>
[root@lustre-oss-node40 ~]# ip route show table service2<br>
<a href="http://10.255.153.0/24">10.255.153.0/24</a> dev service2 scope link src 10.255.153.243<br>
[root@lustre-oss-node40 ~]# ip rule list<br>
0: from all lookup local<br>
32764: from 10.255.153.243 lookup service2<br>
32765: from 10.255.153.242 lookup service1<br>
32766: from all lookup main<br>
32767: from all lookup default<br>
[root@lustre-oss-node40 ~]# ip route<br>
<a href="http://10.255.153.0/24">10.255.153.0/24</a> dev service2 proto kernel scope link src 10.255.153.243 metric 101<br>
<a href="http://10.255.153.0/24">10.255.153.0/24</a> dev service1 proto kernel scope link src 10.255.153.242 metric 102<br>
<br>
3 /etc/modprobe.d/lustre.conf:<br>
options lnet networks="o2ib(service2)[0,1],o2ib(service1)[0,1]"<br>
options libcfs cpu_npartitions=2<br>
options mdt max_mod_rpcs_per_client=128<br>
options mdt mds_io_num_cpts=[0,1]<br>
options mdt mds_num_cpts=[0,1]<br>
options mdt mds_rdpg_num_cpts=[0,1]<br>
options mds mds_num_threads=512<br>
options ost oss_num_threads=512<br>
options ost oss_cpts=[0,1]<br>
options ost oss_io_cpts=[0,1]<br>
options lnet portal_rotor=1<br>
options lnet lnet_recovery_limit=10<br>
options ptlrpc ldlm_enqueue_min=260<br>
<br>
VI. Other Attempts<br>
1 Reduced LNet Timeout and Increased Retry Count:<br>
Both server and client have reduced the LNet timeout and increased the retry count, but the issue persists.<br>
lnetctl set transaction_timeout 10<br>
lnetctl set retry_count 3<br>
lnetctl set health_sensitivity 1<br>
<br>
2 Set Recovery Limit:<br>
Both server and client have set the recovery limit, but the issue persists.<br>
lnetctl set recovery_limit 10<br>
<br>
3 Simulated Network flapping Using iptables:<br>
Simulated network flapping using iptables in the virtual machines, but the issue persists.<br>
#!/bin/bash<br>
for j in {1..1000}; do<br>
date;<br>
echo -e "\nIteration $j: Starting single-port network flapping\n";<br>
for i in {1..10}; do<br>
echo -e " ==== Iteration $i down ===="; date;<br>
sudo iptables -I INPUT 1 -i service1 -j DROP;<br>
sudo iptables -I OUTPUT 1 -o service1 -j DROP;<br>
sleep 20;<br>
echo -e " ==== Iteration $i up ===="; date;<br>
sudo iptables -D INPUT -i service1 -j DROP;<br>
sudo iptables -D OUTPUT -o service1 -j DROP;<br>
sleep 30s;<br>
done<br>
echo -e "\nIteration $j: Ending single-port network flapping\n"; date;<br>
sudo iptables -L INPUT -v | grep -i service1<br>
sudo iptables -L OUTPUT -v | grep -i service1<br>
sleep 120;<br>
done<br>
<br>
VII. Any Suggestions?<br>
Dear all, I would appreciate any suggestions or insights you might have regarding this issue. Thank you!</div>
</div>
<span>_______________________________________________</span><br>
<span>lustre-discuss mailing list</span><br>
<span>lustre-discuss@lists.lustre.org</span><br>
<span>http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org</span><br>
</div>
</blockquote>
</div>
</div>
</div>
</div></div></blockquote>
</body></html>