[Lustre-discuss] Understanding of MMP

Mon Oct 19 11:14:56 PDT 2009

On 19-Oct-09, at 08:46, Michael Schwartzkopff wrote:
> perhaps I have a problem understanding multiple mount protection  
> MMP. I have a
> cluster. When a failover happens sometimes I get the log entry:
>
> Oct 19 15:16:08 sososd7 kernel: LDISKFS-fs warning (device dm-2):
> ldiskfs_multi_mount_protect: Device is already active on another node.
> Oct 19 15:16:08 sososd7 kernel: LDISKFS-fs warning (device dm-2):
> ldiskfs_multi_mount_protect: MMP failure info: last update time:  
> 1255958168,
> last update node: sososd3, last update device: dm-2
>
> Does the second line mean that my node (sososd7) tried to mount /dev/ 
> dm-2 but
> MMP prevented it from doing so because the last update from the old  
> node
> (sososd3) was too recent?

The update time stored in the MMP block is purely for informational
purposes.  It actually uses a sequence counter that has nothing to do
with the system clock on either of the nodes (since they may not be in
sync).

What that message actually means is that sososd7 tried to mount the
filesystem on dm-2 (which likely has another "LVM" name that the kernel
doesn't know anything about) but the MMP block on the disk was modified
by sososd3 AFTER sososd7 first looked at it.

That is a very bad thing, and is exactly what MMP is designed to detect.
Is sososd3 still running at this point, or has it been STONITH'd?

>> From the manuals I found the MMP time of 109 seconds? Is it correct  
>> that after the umount the next node cannot mount the same  
>> filesystem within 10 seconds?

You wrote "109" seconds...  Did you really mean "10" seconds?  In any  
case,
the default MMP timeout is 5s (unless high system load forces this to be
larger), and the mounting node waits 2x this interval to ensure that the
other node has at least one full interval in which to write a new MMP  
block.

> So the solution would be to wait fotr 10 seconds mounting the  
> resource on the
> next node. Is this correct?

If the other node is still using the filesystem, then waiting 10s will  
not
help.  The HA software needs to power off (STONITH) the previous node  
before
it starts a failover.  Otherwise, there may be any number of blocks  
still in
cache or in the IO elevator that might land on the disk after the  
takeover,
if the "failing" node is very slow but not quite dead.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.