[lustre-discuss] Lustre MDT/OST Mount Failures During Virtual Machine Reboot with Pacemaker

Laura Hild lsh at jlab.org
Mon Mar 17 07:20:35 PDT 2025


You'll notice the fix is in the system configuration rather than in Pacemaker itself and what the fix does effectively is choose which of the possibilities to go with.  Just as an example, I'm definitely *not* recommending this, but you could probably also have, say, modified pacemaker.service to put the cluster in maintenance mode before it is stopped.

It is more important that I say that /usr/lib/systemd/system/resource-agents-deps.target is owned by the resource-agents package, and if you're going to do what you said as your solution, you should *not* modify the file in /usr/lib, but rather do it as an actual drop-in as described in what Oyvind linked, i.e. in a .conf file with the first line "[Unit]" in the /etc/systemd/system/resource-agents-deps.target.d/ directory.



________________________________________
Od: chenzufei at gmail.com <chenzufei at gmail.com>
Poslano: petek, 14. marec 2025 23:06
Za: Laura Hild
Kp: lustre-discuss
Zadeva: Re: [lustre-discuss] Lustre MDT/OST Mount Failures During Virtual Machine Reboot with Pacemaker

Thank you for your advice.

A user named Oyvind replied on the users at clusterlabs.org mailing list:
You need the systemd drop-in functionality introduced in RHEL 9.3
to avoid this issue: https://bugzilla.redhat.com/show_bug.cgi?id=2184779

The reason I understand is as follows:
During reboot, both the system and Pacemaker will unmount the Lustre resource simultaneously.
If the system unmounts first and Pacemaker unmounts afterward, Pacemaker will immediately return success.
However, at this point, the system's unmounting process is not yet complete,
causing Pacemaker to mount on the target end, which triggers this issue.

My current modification is as follows:
Add the following lines to the file `/usr/lib/systemd/system/resource-agents-deps.target`:
```
After=remote-fs.target
Before=shutdown.target reboot.target halt.target
```

After making this modification, the issue no longer occurs during reboot.
________________________________
chenzufei at gmail.com



From: Laura Hild<mailto:lsh at jlab.org>
Date: 2025-03-06 06:12
To: chenzufei at gmail.com<mailto:chenzufei at gmail.com>
CC: lustre-discuss<mailto:lustre-discuss at lists.lustre.org>
Subject: Re: [lustre-discuss] Lustre MDT/OST Mount Failures During Virtual Machine Reboot with Pacemaker
I'm not sure what to say about how Pacemaker *should* behave, but I *can* say I virtually never try to (cleanly) reboot a host from which I have not already evacuated all resources, e.g. with `pcs node standby` or by putting Pacemaker in maintenance mode and unmounting/exporting everything manually.  If I can't evacuate all resources and complete a lustre_rmmod, the host is getting power-cycled.

So maybe I can say, my guess would be that in the host's shutdown process, stopping the Pacemaker service happens before filesystems are unmounted, and that Pacemaker doesn't want to make an assumption whether its own shut-down means it should standby or initiate maintenance mode, and therefore the other host ends up knowing only that its partner has disappeared, while the filesystems have yet to be unmounted.



More information about the lustre-discuss mailing list