[lustre-discuss] Experimenting with "live" ldiskfs mount of MDT

Fri Jun 17 10:37:07 PDT 2016

Greetings All,

IMPORTANT: Do not try this on a live system with data that you care about.  The following is not documented or supported in any way, and only barely tested.  You have been warned!

Now that I have your attention, what I am referring to is a method to use "hidden" mounts on a server to get a "live" posix interface to a lustre target, without using LVM or hardware-based snapshots.  I have only experimented with an MDT using ldiskfs; ZFS may have ways of making this safer and easier, or it might not work at all.  While a Lustre MDT is running, we can use bind and remount tricks to see it as ldiskfs *and* get it into a read-only mode.

I'm posting this just in case it fills an emergency need that someone has.  Or, if you are feeling adventurous and want to experiment and report back that would be great.  Maybe this actually works well and allows for cool new use cases and can become a intentional and documented feature in the future!

As I mentioned, this is not a supported feature of Lustre.  I believe it might conceivably be so useful for backing up the MDT, performing fast filesystem analysis, setting up lightweight event triggers, debugging, etc... that I think it *could* be worth trying to enhance and make it supportable in the future.  If anyone wants to play with it, or has a student in need of a research project, well... here you go.  An example showing the procedure I came up with, along with a few more technical details, are included below.

Thanks and happy hacking,
Nathan

=== Test Procedure ===

* Setup:
        # mkdir /mnt/.hidden_mdt
        # mkdir /mnt/live-mirror

* Then on to the sketchy stuff:
  (The "lfstest--vg-mdttest" device is already a running MDT with active clients.)
        # mount -t ldiskfs /dev/mapper/lfstest--vg-mdttest /mnt/.hidden_mdt
        # mount --bind /mnt/.hidden_mdt /mnt/live-mirror
        # mount -o remount,ro /mnt/live-mirror
        # umount /mnt/.hidden_mdt

* Now we have a read-only mount to access the MDT:
        # mount | grep mdt
    /dev/mapper/lfstest--vg-mdttest on /mnt/lustre/lfstest-mdt type lustre (rw,noauto,acl,errors=panic,user_xattr)
    /mnt/.hidden_mdt on /mnt/live-mirror type none (ro,bind)

        # echo "this should not work" > /mnt/live-mirror/ROOT/test
    -bash: /mnt/live-mirror/ROOT/test: Read-only file system

* Client changes show up immediately:
        # ssh client1 'echo "ASDF" > /mnt/lustre/client/test2'

        # ls -l /mnt/live-mirror/ROOT/test2
    -rw-r--r-- 1 root root 0 May  7 14:51 /mnt/live-mirror/ROOT/test2

  Note that instant updates to all metadata are probably not guaranteed, due to cache coherency of multiple threads.

* And we can read all the Extended Attributes, such as with:
        # getfattr -d -m ".*" -e hex /mnt/live-mirror/ROOT/test2

* Run any system check, backup, or analysis processes as desired...

* Cleanup:
        # sync
        # umount /mnt/live-mirror

=== Analysis with System Tap ===

* In a separate window, watch for journal activity with:
        # stap -e 'probe module("ldiskfs").function("*journal*") { printf("%d %s %s\n", tid(), execname(), probefunc()) }'

  Testing of a simpler read-only remount made several journal calls, even with "noload".  That is why the extra bind mounts are required; they appeared to make no journal calls, showing that they are safer.

  Note that in the very last "umount" step of the test procedure I did actually see journal events in system tap.  However, none were for the "umount" process, so I believe they were just buffered from the client write, holdovers from sync, or periodic housekeeping.  Repeating the whole process without any client writes resulted in no system tap events at all, as expected.

* In a separate window, look for all function calls from the ldiskfs module:
        # stap -e 'probe module("ldiskfs").function("*") { printf("%d %s %s\n", tid(), execname(), probefunc()) }'

  Testing showed that the total set of calls made through the "sketchy stuff" is:
        10354 mount ldiskfs_get_sb
        10356 mount ldiskfs_release_dir
        10356 mount ldiskfs_release_dir

=== Additional Notes ===

In essence what this is doing is using the same superblock, inodes, etc. for the ldiskfs mount as the running Lustre filesystem.  Internally, Lustre mounts the MDT as ldiskfs, and as long as the second mount uses the same filesystem type, then the kernel doesn't really mount the device again, it just creates a second reference to the superblock.  The extra steps to make it a read-only mount are simply there to reduce the risk of corrupting the filesystem by accidental modification using the wrong mountpoint.

Be warned that other approaches that might bypass the kernel checks for multiple mounts of the same block device are more dangerous than this approach, since even a "read-only" mount of the device will still have side-effects like recovering the journal (while the filesystem is already in use by Lustre) that could seriously corrupt the filesystem!

The bigest risk areas for the use of the "live" ldiskfs and bind mounts are in areas where Lustre is accessing the ldiskfs on-disk structures or in-memory cached items in a manner *differently* from the VFS.  There is a chance that this may cause in-memory inconsistencies, crashes, or potentially data corruption.  I have been warned of problems during shutdown if the ldiskfs mountpoint is still mounted when Lustre stops but have not yet tested that.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20160617/f8a2becb/attachment.htm>