<!DOCTYPE html>
<html>
<head>
<meta http-equiv="content-type" content="text/html; charset=UTF-8">
</head>
<body>
<p class="isSelectedEnd"><span>Hello everyone,</span></p>
<p class="isSelectedEnd"><span>I’m troubleshooting an issue on Rocky
Linux 8.10 where an MGS mount appears to succeed (exit code 0)
but the server unmounts a few seconds later due to MGC request
timeouts. Because of this, MDT/OST targets cannot register with
the MGS afterwards and I can't get a working filesystem.</span></p>
<p><span>At first I thought this was related to switching from
ldiskfs to ZFS (OpenZFS DKMS), because the problem started after
installing ZFS DKMS (from the Lustre repo) and rebuilding
modules. However, I reproduced the same behavior even when using
the kmod-based Lustre packages and also when trying ldiskfs
again, so I'm a bit lost on what could have caused the issue.</span></p>
<p>I'm running Lustre 2.15.7 on Rocky 8.10, and this behaviour
happens on the (only) MGS/MDS node. To reproduce the issue, I only
need to format the MGT and then mount it normally. The mount
command returns success, but shortly after that the server
unmounts automatically. </p>
<p>Example output in dmesg: </p>
<p><font face="monospace">Lustre:
236012:0:(client.c:2295:ptlrpc_expire_one_request()) @@@ Request
sent has timed out for slow reply: [sent 1768852634/real
1768852634] req@000000001ae72b11 x1854776314691712/t0(0)
o251->MGC10.0.0.4@tcp@0@lo:26/25 lens 224/224 e 0 to 1 dl
1768852640 ref 2 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'</font></p>
<p><font face="monospace">Lustre: server umount MGS complete</font></p>
<p>A couple things I've tried:</p>
<ul>
<li><font face="monospace">lnetctl ping <mgs_ip>@tcp</font>
works</li>
<li><font face="monospace">lctl list_nids</font> shows the correct
NID</li>
<li><font face="monospace">lnetctl net show</font> shows the TCP
NI on the correct interface and the loopback </li>
<li><font face="monospace">lnetctl ping <mgs_ip>@tcp</font>
from another host works</li>
<li>Port 988 is listening and open</li>
<li><span>Disabling firewalld does not change anything</span></li>
<li><span>SELinux is disabled</span></li>
<li>Removed all Lustre, zfs, kmod and dkms packages, rebuilt
initramfs, changed to stock kernel and back to custom Lustre
kernel, reinstalled all packages, etc.</li>
</ul>
<p>However nothing worked and I can't explain the issue nor why I
can't even mount through regular ldiskfs anymore.</p>
<p>Does anyone know the cause behind this issue and what could I do
to fix it? My last resort would be reinstalling the OS and
starting from scratch but I would very much prefer not to do that.
This is a testing environment so I don't mind having to reformat,
recreate or reinstall anything.</p>
<p>Thank you very much in advance.</p>
<p>Santiago</p>
<p><br>
</p>
</body>
</html>