[lustre-devel] RFC: Spill device for Lustre OSD

Tue Nov 4 15:30:48 PST 2025

> On Nov 4, 2025, at 13:11, Day, Timothy <timday at amazon.com> wrote:
> 
>>>> Generic Lustre utilities won’t be able to directly access spilling
>>>> devices. For example, **lfs df** will only display the capacity of the
>>>> OSD device, while the capacity of the spilling device can be accessed
>>>> using specific options.
>>> 
>>> I don't see why there has to be a separation between the spill device
>>> and the OSD device. I think it's a simpler experience for end-users
>>> if these details are handled transparently. Ideally, all of the tiering
>>> complexity would be handled below the OSD API boundary. Lustre itself
>>> i.e. "the upper layers" shouldn’t need to modified.
>> 
>> Yeah do not touch the upper layer is one of the goals. By design, spill device is a piece of private information to that particular OSD, the rest of Lustre stack won’t be able to see it.
> 
> We would somehow expose the existence of the spill device
> via `lfs df`, so the stat from client to server would have to change?
> And you mention some difficulty supporting s3fs/gcsfuse
> with OFD. So you anticipate at least some upper layer changes?

Yeah we need to change that, but I’d expect the change is minimal and only limited into statfs, mostly independent of the rest Lustre stack.

> 
>>>> OSDs will access spilling devices through VFS interfaces in the kernel
>>>> space. Therefore, the spilling device must be mountable into the
>>>> kernel namespace. Initially, only HDD is supported, and the
>>>> implementation can be extended to S3 and GCS using s3fs and gcsfuse
>>>> respectively in the future.
>>> 
>>> This isn't clear to me. Once we support HDD, what incremental work
>>> Is needed to support s3fs/gcsfuse/etc? As long as they present a
>>> normal filesystem API, they should all work the same?
>>> 
>>> This begs the question: if we're already doing this work to support
>>> writing Lustre objects to any arbitrary filesystem via VFS and we're only
>>> intending to support OSTs with this proposal, why not implement
>>> an OST-only VFS OSD and handle tiering in the filesystem layer?
>> 
>> VFS OSD won’t give us everything to make it an OSD. Transaction is one of the issue as Oleg mentioned.
> 
> We'll have to implement a lot of complexity in kernel space for
> the design you're suggesting. At some point, implementing
> a transaction log and delegating the rest to user space might
> be the more maintainable option? Doing that via VFS/Fuse is
> one option. Something similar to ublk for OSD could be another
> option.

Not at all. It should be simple and straightforward. Mostly we just call the vfs_*() from the underlying file system. All the complexities will be left in the primary OSD, like using llog to avoid object leakage in the spill device.

Yeah it’s possible to up call user space interfaces in a late stage.

> 
> I'm not saying this is preferrable, but is this something you've
> considered?
> 
>> Extended attributes would be another thing since not all file systems would support it.
> 
> I think it's fine not accept filesystems that don't support EA. Or have the
> OSD advertise that it doesn't support EA.
> 
>>>> There’s no on-demand migration feature. If there’s no available space
>>>> while a large chunk of data is being written, the system will return
>>>> ENOSPC to the client. This simplifies the granting mechanism on the
>>>> OFD significantly. Another reason for this approach is that writing
>>>> performance is more predictable. Otherwise, the system might have to
>>>> wait for an object to be migrated on demand, leading to much higher
>>>> latency.
>>> 
>>> I think users would tolerate slow jobs more than write failures because
>>> the OSD isn't smart enough to write to the empty spill device.
>> 
>> The only reason that user uses Lustre is because of performance. If a misconfiguration would lead to 100 seconds of latency, that is definitely not they would expect. This product is not designed for >that use case.
> 
> So if you want to write faster than the migration daemon can
> free space, you have to accept a write failure? I can't really
> think of a workload where this is preferable. I think this
> behavior should be tunable, at least.

Yeah it’s possible to make it a tunable option. The key concept here is that the primary OSDs are still the size of your Lustre instance, which should be large enough to hold your entire local workset.

> 
>>>> deliver better performance than block-level cache with dmcache, where
>>>> recovery time is lengthy if the cache size is huge.
>>> 
>>> This seems speculative. Could you elaborate more on why you think
>>> this is the case?
>> 
>> DMcache doesn’t persist the bitmap used to indicate which blocks are holding dirty data, so an ungraceful shutdown will lead to a scanning of the entire cache in order to determine which blocks are > dirty.
> 
> This is a failing of DMcache rather than an indication that this
> problem can't be solved on the block layer.
> 
> Overall, I think the concept is interesting. It reminds me of how
> Bcachefs handle multi-device support. Each device can be
> designated as holding metadata or data replicas. And you
> can control the promotion and migration between different
> targets (all managed by a migration daemon). But this design is
> too limited, IMHO. If we're going to accept the additional complexity
> in the OSD, the solution has to be extensible. What if I want to
> replicate to multiple targets? What if I want more than two tiers?
> What if I want to transparently migrate data from one spill device to
> another? We don't need this for the initial implementation, sure.
> But these seem like natural extensions.
> 
> I think we need to use some kind of common API for the
> different devices. Even if the spill device doesn't support atomic
> transactions, I don't see why we couldn't still use the common OSD
> API and implement the migration daemon as a stacking driver on-top
> of that. The spill device OSD driver could be made to advertise that it
> doesn't support atomic transactions and may not support EA. But we
> get the added benefit of being able to use existing OSDs with this
> feature, pretty much for free.
>