[lustre-devel] Direct Modification of Lustre Metadata on Disk

Sat Jan 27 21:13:25 PST 2024

On Jan 26, 2024, at 12:57, Saisha Kamat via lustre-devel <lustre-devel at lists.lustre.org<mailto:lustre-devel at lists.lustre.org>> wrote:

I am a Ph.D. student at UNC-Charlotte, focusing on research related to
the Lustre File System. As part of my project, I am investigating
scenarios involving the direct modification of xattr metadata on the
Lustre disk, without unmounting the Lustre servers.

It would be helpful to know what the high-level goal of your research is?
Is this some type of fault injection mechanism, or are you trying to store
useful data directly into the xattr, or something else?  Note that there
have already been a few papers published about this.  If you are looking
for research ideas related to Lustre I could definitely give you a few, please
contact me if interested.  Doubly so if you actually implement something
that is useful at the end of your Ph.D. and not a throw-away project.

To achieve this, I have attempted to open the MDS (Metadata Server)
disk partition as a file descriptor, locate the target file and its
xattr, and write a faulty value. However, I have encountered an
unexpected issue where my changes appear to be saved to memory and are
not being synchronized with the disk.

In general, this is also a good way to corrupt the filesystem.  If the xattr
is stored directly in the inode (as most of them are) then you will also be
overwriting the live inode that is also in memory.  In many cases, whatever
was written directly to disk will be overwritten and lost when the inode is
flushed from memory.

Alternately, if the inode is already in memory, the xattr will be read from
RAM (either from the client cache, or from the MDS cache.

If you create a large xattr it will be written to a separate block, which
would at least avoid massive filesystem corruption.

After completing the write operation, when I read the same xattr
again, it reflects the corrupted value. Strangely, when using the
"getfattr" command, the original, correct value is displayed. This
discrepancy has raised doubts about whether Lustre permits direct
modifications to its metadata on the disk.

The xattr contents are also cached on the client, and direct writes
to the storage would not invalidate that cache because they bypass
all of the proper access controls and locking.

Furthermore, I observed that even after unmounting and remounting the
Lustre file system, the xattr continues to display the corrupted value
upon reading, whereas "getfattr" still returns the original, correct
value.

That really depends on how you modified the "xattr" and where "getfattr"
is actually getting the data from.  I suspect you aren't doing what you
think you are doing.

Please help me understand whether Lustre allows direct modifications
to its metadata on the disk and if there are any inherent limitations
or considerations that I should be aware of.

No, of course Lustre and ext4 do not "allow" this.  Just like any filesystem
doesn't "allow" you to run "dd if=/dev/zero of=/dev/sda1" and erase the
data from the partition.

Additionally, any recommendations or alternative approaches for
simulating faulty conditions for testing purposes would be highly
valuable to my research.

That really depends on what your research is trying to achieve.  Lustre
depends on reliable (RAID) storage underneath the MDT and OST.  It
is possible to use ldiskfs (ext4) or ZFS as underlying storage, and they
have different reliability vs. performance properties.  If you are testing
to directly corrupt on-disk storage then you are really testing those disk
filesystems, and Lustre does not add additional data redundancy layers
on top of them for metadata today, though there are *some* types of
internal metadata redundancy that can help recover from storage errors
(e.g. LFSCK can rebuild the Lustre file layout after errors on the MDT,
along with some types of directory breakage from the "link" xattr).

ZFS should be able to withstand such data/metadata block corruption
up to a certain level without any errors, until it just refuses to work at all.
ldiskfs would *not* be able to handle outright corruption of the on-disk
data (which is why you use RAID underneath it), but most corruption
would be localized and the filesystem would generally continue to work
(modulo the broken bits) even in the face of massive corruption.  Kind
of like the difference between digital and analog audio signals.

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Whamcloud

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-devel-lustre.org/attachments/20240128/8a907afe/attachment.htm>