[lustre-discuss] Lustre MDT/OST Mount Failures During Virtual Machine Reboot with Pacemaker

Thu Mar 6 08:18:34 PST 2025

To add to this, instead of issuing a straight reboot, I prefer running 
'pcs stonith fence <node>' which will fail over resources appropriately 
AND reboot the node (if doable) or otherwise power it off. The advantage 
to doing it this way is that it keeps Pacemaker in-the-know about the 
state of the node so it doesn't (usually) shoot it as it's trying to 
boot back up. When you're doing maintenance on a node without letting 
Pacemaker know about it, results can be unpredictable.

Cameron

On 3/5/25 2:12 PM, Laura Hild via lustre-discuss wrote:
> I'm not sure what to say about how Pacemaker *should* behave, but I *can* say I virtually never try to (cleanly) reboot a host from which I have not already evacuated all resources, e.g. with `pcs node standby` or by putting Pacemaker in maintenance mode and unmounting/exporting everything manually.  If I can't evacuate all resources and complete a lustre_rmmod, the host is getting power-cycled.
>
> So maybe I can say, my guess would be that in the host's shutdown process, stopping the Pacemaker service happens before filesystems are unmounted, and that Pacemaker doesn't want to make an assumption whether its own shut-down means it should standby or initiate maintenance mode, and therefore the other host ends up knowing only that its partner has disappeared, while the filesystems have yet to be unmounted.
>
> _______________________________________________
> lustre-discuss mailing list
> lustre-discuss at lists.lustre.org
> https://urldefense.us/v3/__http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org__;!!G2kpM7uM-TzIFchu!0RFI5fXw0SvxL-3t8fqoESM6EpPmNWAltjI8fbf9DcPG9n25cKHYbYq8Vgvp_9RgVVAzDg8YrfM_xqAwLvKjxP7NqvwdWQ$