[lustre-devel] RFC: Spill device for Lustre OSD

Tue Nov 4 15:07:08 PST 2025

On Nov 4, 2025, at 14:11, Day, Timothy <timday at amazon.com> wrote:

VFS OSD won’t give us everything to make it an OSD. Transaction is one of the issue as Oleg mentioned.

We'll have to implement a lot of complexity in kernel space for
the design you're suggesting. At some point, implementing
a transaction log and delegating the rest to user space might
be the more maintainable option? Doing that via VFS/Fuse is
one option. Something similar to ublk for OSD could be another
option.

If the writes to the spill device are well managed, it may be possible
to do it without transaction support, so long as they are not exposed
directly for writing to the clients.  Otherwise they essentially need
to be OSDs that expose transactions and recovery semantics.

I'm not saying this is preferrable, but is this something you've
considered?

Extended attributes would be another thing since not all file systems would support it.

I think it's fine not accept filesystems that don't support EA.
Or have the OSD advertise that it doesn't support EA.

I don't think any storage system we care about today does not support
EAs or tags or similar metadata that can be used for this.

deliver better performance than block-level cache with dmcache, where
recovery time is lengthy if the cache size is huge.

This seems speculative. Could you elaborate more on why you think
this is the case?

DMcache doesn’t persist the bitmap used to indicate which blocks are holding dirty data, so an ungraceful shutdown will lead to a scanning of the entire cache in order to determine which blocks are > dirty.

This is a failing of DMcache rather than an indication that this
problem can't be solved on the block layer.

That was my original discussion with Jinshan as well.  There was a
similar issue with mdraid, and they ended up with persistent bitmaps
in a flash device.  It should be possible to manage this with logs
or bitmaps stored in the fast OSD device instead of the spill device.

Overall, I think the concept is interesting. It reminds me of how
Bcachefs handle multi-device support. Each device can be
designated as holding metadata or data replicas. And you
can control the promotion and migration between different
targets (all managed by a migration daemon). But this design is
too limited, IMHO. If we're going to accept the additional complexity
in the OSD, the solution has to be extensible. What if I want to
replicate to multiple targets? What if I want more than two tiers?
What if I want to transparently migrate data from one spill device to
another? We don't need this for the initial implementation, sure.
But these seem like natural extensions.

This is essentially replicating Lustre file layouts in the end, which
was my original suggestion - to use FLR and/or PCC-RO foreign
mirror layouts for this, even if it is not directly accessible from
clients.  That avoids reimplementing tools/formats that already
exist in Lustre today for relatively little benefit.

I think we need to use some kind of common API for the
different devices. Even if the spill device doesn't support atomic
transactions, I don't see why we couldn't still use the common OSD
API and implement the migration daemon as a stacking driver on-top
of that. The spill device OSD driver could be made to advertise that it
doesn't support atomic transactions and may not support EA. But we
get the added benefit of being able to use existing OSDs with this
feature, pretty much for free.

Also my argument.  Using the VFS directly is constraining (lack of
transactions), but of backends that _can_ be a full OSD (or already
have an OSD like ldiskfs, ZFS, memfs) it is a drop-in replacement.

There is already an OSD API to query the functionality of the backing
storage, so it should be straight forward to add "transaction", "xattr",
and other supported features to that.

If we can implement a no-transaction osd-vfs, that would expose a
lot of flexibility for other reasons as well.  Possibly the osd-vfs could
implement a journal or other logging layer internally to make up for
lack of transactions, whether initially or at a later stage?

Cheers, Andreas
—
Andreas Dilger
Lustre Principal Architect
Whamcloud/DDN

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-devel-lustre.org/attachments/20251104/a6eb5618/attachment.htm>