[lustre-discuss] Lustre MDT/OST Mount Failures During Virtual Machine Reboot with Pacemaker
Cameron Harr
harr1 at llnl.gov
Thu Mar 6 08:18:34 PST 2025
To add to this, instead of issuing a straight reboot, I prefer running
'pcs stonith fence <node>' which will fail over resources appropriately
AND reboot the node (if doable) or otherwise power it off. The advantage
to doing it this way is that it keeps Pacemaker in-the-know about the
state of the node so it doesn't (usually) shoot it as it's trying to
boot back up. When you're doing maintenance on a node without letting
Pacemaker know about it, results can be unpredictable.
Cameron
On 3/5/25 2:12 PM, Laura Hild via lustre-discuss wrote:
> I'm not sure what to say about how Pacemaker *should* behave, but I *can* say I virtually never try to (cleanly) reboot a host from which I have not already evacuated all resources, e.g. with `pcs node standby` or by putting Pacemaker in maintenance mode and unmounting/exporting everything manually. If I can't evacuate all resources and complete a lustre_rmmod, the host is getting power-cycled.
>
> So maybe I can say, my guess would be that in the host's shutdown process, stopping the Pacemaker service happens before filesystems are unmounted, and that Pacemaker doesn't want to make an assumption whether its own shut-down means it should standby or initiate maintenance mode, and therefore the other host ends up knowing only that its partner has disappeared, while the filesystems have yet to be unmounted.
>
> _______________________________________________
> lustre-discuss mailing list
> lustre-discuss at lists.lustre.org
> https://urldefense.us/v3/__http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org__;!!G2kpM7uM-TzIFchu!0RFI5fXw0SvxL-3t8fqoESM6EpPmNWAltjI8fbf9DcPG9n25cKHYbYq8Vgvp_9RgVVAzDg8YrfM_xqAwLvKjxP7NqvwdWQ$
More information about the lustre-discuss
mailing list