[Lustre-discuss] Heartbeat, LVM and Lustre

Brian J. Murrell Brian.Murrell at Sun.COM
Thu Dec 10 12:29:30 PST 2009


On Fri, 2009-12-11 at 02:01 +1100, Atul Vidwansa wrote: 
> 
> When I reboot MDS nodes and start MDTs with "service 
> heartbeat start" simultaneously on both mds nodes, sometimes I get 
> following message:

With both nodes up and running at the same time, likely they have both
done a vgscan; vgchange -a y on the shared disk(s).  I don't know that
this is in itself a problem.  I do the same thing here and I have not
(yet) seen any ill effects.  I am far from an LVM expert however.

> mds1: 2009/12/10_13:48:08 CRITICAL: Resource LVM::mgsvg is active, and 
> should not be!
> mds1: 2009/12/10_13:48:08 CRITICAL: Non-idle resources can affect data 
> integrity!

I wonder how it's determining that LVM:mgsvg is "active" and what it
considers "active".  A look into the source for that would most likely
be very fruitful.  And it was.

It seems that "/usr/lib/ocf/resource.d/heartbeat/LVM status" is what is
used to determine who owns the resource.  The LVM resource script does
that with a:

vgdisplay [-v if lvm version is >= 2 ] $volume 2>&1 | grep -i 'Status[ \t]*available'

What is interesting is on my LVM 2 system, vgdisplay with -v also shows
a:

  LV Status              available

for every volume in the VG.  I wonder if they are just not accounting
for that.  Or maybe that's what they are looking for given that on my
active and in use LVM system here, for the VG itself, Status shows:

  VG Status             resizable

So they can't be looking for an "available" in the VG Status for
"resource ownership" and must want the LV Status line(s).

Looking a little further, the LVM script has both "start" and "stop"
actions which presumably heartbeat invokes to (dis-)"own" a resource.
These two actions do:

vgscan; vgchange -a y $1

and

vgchange -a n $1

respectively.  That implies that heartbeat wants to own an entire VG or
nothing.  It would appear you cannot have multiple volumes from a single
VG owned by different nodes.  As I said, I do this myself and have found
no issues, but am not at all a heavy, or what I would call "production"
user.

> and heartbeat on both mds nodes does not start any resource (even after 
> waiting for 35 minutes).

Well, it would seem that heartbeat has found a condition it considers
dangerous and stopping there so as not to cause any damage.  From the
looks of things, you will need to disable the operating system's LVM
startup code and leave it to heartbeat manage, if you buy into their
assumptions.  Might be worth a question or two on the LVM list to see if
the assumptions are valid or not -- or resign yourself to allowing
heartbeat to operate LVM resource ownership at the VG level and not LV
level.

Cheers,
b.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 197 bytes
Desc: This is a digitally signed message part
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20091210/f28dbd17/attachment.pgp>


More information about the lustre-discuss mailing list