[lustre-discuss] MGS mount succeeds then unmounts due to MGC timeout
Santiago Freire - InCo
sfreire at fing.edu.uy
Mon Jan 19 12:13:07 PST 2026
Hello everyone,
I’m troubleshooting an issue on Rocky Linux 8.10 where an MGS mount
appears to succeed (exit code 0) but the server unmounts a few seconds
later due to MGC request timeouts. Because of this, MDT/OST targets
cannot register with the MGS afterwards and I can't get a working
filesystem.
At first I thought this was related to switching from ldiskfs to ZFS
(OpenZFS DKMS), because the problem started after installing ZFS DKMS
(from the Lustre repo) and rebuilding modules. However, I reproduced the
same behavior even when using the kmod-based Lustre packages and also
when trying ldiskfs again, so I'm a bit lost on what could have caused
the issue.
I'm running Lustre 2.15.7 on Rocky 8.10, and this behaviour happens on
the (only) MGS/MDS node. To reproduce the issue, I only need to format
the MGT and then mount it normally. The mount command returns success,
but shortly after that the server unmounts automatically.
Example output in dmesg:
Lustre: 236012:0:(client.c:2295:ptlrpc_expire_one_request()) @@@ Request
sent has timed out for slow reply: [sent 1768852634/real 1768852634]
req at 000000001ae72b11 x1854776314691712/t0(0)
o251->MGC10.0.0.4 at tcp@0 at lo:26/25 lens 224/224 e 0 to 1 dl 1768852640 ref
2 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'
Lustre: server umount MGS complete
A couple things I've tried:
* lnetctl ping <mgs_ip>@tcp works
* lctl list_nids shows the correct NID
* lnetctl net show shows the TCP NI on the correct interface and the
loopback
* lnetctl ping <mgs_ip>@tcp from another host works
* Port 988 is listening and open
* Disabling firewalld does not change anything
* SELinux is disabled
* Removed all Lustre, zfs, kmod and dkms packages, rebuilt initramfs,
changed to stock kernel and back to custom Lustre kernel,
reinstalled all packages, etc.
However nothing worked and I can't explain the issue nor why I can't
even mount through regular ldiskfs anymore.
Does anyone know the cause behind this issue and what could I do to fix
it? My last resort would be reinstalling the OS and starting from
scratch but I would very much prefer not to do that. This is a testing
environment so I don't mind having to reformat, recreate or reinstall
anything.
Thank you very much in advance.
Santiago
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20260119/030a7af2/attachment.htm>
More information about the lustre-discuss
mailing list