[lustre-devel] RFC: Spill device for Lustre OSD

Day, Timothy timday at amazon.com
Wed Nov 5 09:40:04 PST 2025


>>>Overall, I think the concept is interesting. It reminds me of how
>>>Bcachefs handle multi-device support. Each device can be
>>>designated as holding metadata or data replicas. And you
>>>can control the promotion and migration between different
>>>targets (all managed by a migration daemon). But this design is
>>>too limited, IMHO. If we're going to accept the additional complexity
>>>in the OSD, the solution has to be extensible. What if I want to
>>>replicate to multiple targets? What if I want more than two tiers?
>>>What if I want to transparently migrate data from one spill device to
>>>another? We don't need this for the initial implementation, sure.
>>>But these seem like natural extensions.
>
>It’s possible to extend the design to have multiple spill devices in the OSD; you could have two spill devices and mirror them, or raid0 to make a larger device. I don’t see the design would not allow you to do that. 

I think I'm more concerned with the terminology and framing of the
design. The mental model of the design is:

Here's an OSD. It can grow one (or more) spill devices where you can offload block data.

But I think we should think about it like:

You can have a single device OSD or a multi-device (or pool) OSD. Each device
in the pool can be assigned a role in the pool. You define a policy for what
devices a write has to land on before it's committed. You define what devices
are used for caching reads. You define migration policies. You define where the
pool configuration and metadata would live. etc. etc. The configuration you lay
out in the design would be the first supported configuration of a larger
multi-device feature.

I think that's a more cohesive way of thinking about this, IMO.

>>>I think we need to use some kind of common API for the
>>>different devices. Even if the spill device doesn't support atomic
>>>transactions, I don't see why we couldn't still use the common OSD
>>>API and implement the migration daemon as a stacking driver on-top
>>>of that. The spill device OSD driver could be made to advertise that it
>>>doesn't support atomic transactions and may not support EA. But we
>>>get the added benefit of being able to use existing OSDs with this
>>>feature, pretty much for free.
>>
>>Also my argument.  Using the VFS directly is constraining (lack of
>>transactions), but of backends that _can_ be a full OSD (or already
>>have an OSD like ldiskfs, ZFS, memfs) it is a drop-in replacement.
>>
>>There is already an OSD API to query the functionality of the backing
>>storage, so it should be straight forward to add "transaction", "xattr",
>>and other supported features to that.
>>
>>If we can implement a no-transaction osd-vfs, that would expose a
>>lot of flexibility for other reasons as well.  Possibly the osd-vfs could
>>implement a journal or other logging layer internally to make up for
>>lack of transactions, whether initially or at a later stage?
>
>What would be the benefit of having a limited OSD in the stack? I don’t have a strong opinion of not doing it, but I just didn’t see any benefits of doing it.

Flipping the question around: what's the benefit of having multiple APIs
for talking to underlying storage? Using a common API allows us to use
osd-ldiskfs/osd-zfs for the spill device. Or use the osd-vfs for MGS. And
the OSD API is stable and well documented.



More information about the lustre-devel mailing list