[lustre-devel] RFC: Spill device for Lustre OSD

Tue Nov 4 15:42:40 PST 2025

> On Nov 4, 2025, at 15:07, Andreas Dilger <adilger at ddn.com> wrote:
> 
> On Nov 4, 2025, at 14:11, Day, Timothy <timday at amazon.com> wrote:
>> 
>>> VFS OSD won’t give us everything to make it an OSD. Transaction is one of the issue as Oleg mentioned.
>> 
>> We'll have to implement a lot of complexity in kernel space for
>> the design you're suggesting. At some point, implementing
>> a transaction log and delegating the rest to user space might
>> be the more maintainable option? Doing that via VFS/Fuse is
>> one option. Something similar to ublk for OSD could be another
>> option.
> 
> If the writes to the spill device are well managed, it may be possible
> to do it without transaction support, so long as they are not exposed
> directly for writing to the clients.  Otherwise they essentially need
> to be OSDs that expose transactions and recovery semantics.
> 
>> I'm not saying this is preferrable, but is this something you've
>> considered?
>> 
>>> Extended attributes would be another thing since not all file systems would support it.
>> 
>> I think it's fine not accept filesystems that don't support EA.
>> Or have the OSD advertise that it doesn't support EA.
> 
> I don't think any storage system we care about today does not support
> EAs or tags or similar metadata that can be used for this.
> 
>>>>> deliver better performance than block-level cache with dmcache, where
>>>>> recovery time is lengthy if the cache size is huge.
>>>> 
>>>> This seems speculative. Could you elaborate more on why you think
>>>> this is the case?
>>> 
>>> DMcache doesn’t persist the bitmap used to indicate which blocks are holding dirty data, so an ungraceful shutdown will lead to a scanning of the entire cache in order to determine which blocks are > dirty.
>> 
>> This is a failing of DMcache rather than an indication that this
>> problem can't be solved on the block layer.
> 
> That was my original discussion with Jinshan as well.  There was a
> similar issue with mdraid, and they ended up with persistent bitmaps
> in a flash device.  It should be possible to manage this with logs
> or bitmaps stored in the fast OSD device instead of the spill device.
> 
>> Overall, I think the concept is interesting. It reminds me of how
>> Bcachefs handle multi-device support. Each device can be
>> designated as holding metadata or data replicas. And you
>> can control the promotion and migration between different
>> targets (all managed by a migration daemon). But this design is
>> too limited, IMHO. If we're going to accept the additional complexity
>> in the OSD, the solution has to be extensible. What if I want to
>> replicate to multiple targets? What if I want more than two tiers?
>> What if I want to transparently migrate data from one spill device to
>> another? We don't need this for the initial implementation, sure.
>> But these seem like natural extensions.

It’s possible to extend the design to have multiple spill devices in the OSD; you could have two spill devices and mirror them, or raid0 to make a larger device. I don’t see the design would not allow you to do that. 

> 
> This is essentially replicating Lustre file layouts in the end, which
> was my original suggestion - to use FLR and/or PCC-RO foreign
> mirror layouts for this, even if it is not directly accessible from
> clients.  That avoids reimplementing tools/formats that already
> exist in Lustre today for relatively little benefit.

One of the goals is to not have a file system-level scanner, which is not good. Otherwise, we can just use FLR-based tiered storage.

> 
>> I think we need to use some kind of common API for the
>> different devices. Even if the spill device doesn't support atomic
>> transactions, I don't see why we couldn't still use the common OSD
>> API and implement the migration daemon as a stacking driver on-top
>> of that. The spill device OSD driver could be made to advertise that it
>> doesn't support atomic transactions and may not support EA. But we
>> get the added benefit of being able to use existing OSDs with this
>> feature, pretty much for free.
> 
> Also my argument.  Using the VFS directly is constraining (lack of
> transactions), but of backends that _can_ be a full OSD (or already
> have an OSD like ldiskfs, ZFS, memfs) it is a drop-in replacement.
> 
> There is already an OSD API to query the functionality of the backing
> storage, so it should be straight forward to add "transaction", "xattr",
> and other supported features to that.
> 
> If we can implement a no-transaction osd-vfs, that would expose a
> lot of flexibility for other reasons as well.  Possibly the osd-vfs could
> implement a journal or other logging layer internally to make up for
> lack of transactions, whether initially or at a later stage?

What would be the benefit of having a limited OSD in the stack? I don’t have a strong opinion of not doing it, but I just didn’t see any benefits of doing it.

> 
> Cheers, Andreas
> —
> Andreas Dilger
> Lustre Principal Architect
> Whamcloud/DDN
> 
> 
> 
> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-devel-lustre.org/attachments/20251104/a02c695e/attachment-0001.htm>