[lustre-devel] RFC: Spill device for Lustre OSD

Tue Nov 4 13:11:39 PST 2025

>>> Generic Lustre utilities won’t be able to directly access spilling
>>> devices. For example, **lfs df** will only display the capacity of the
>>> OSD device, while the capacity of the spilling device can be accessed
>>> using specific options.
>>
>> I don't see why there has to be a separation between the spill device
>> and the OSD device. I think it's a simpler experience for end-users
>> if these details are handled transparently. Ideally, all of the tiering
>> complexity would be handled below the OSD API boundary. Lustre itself
>> i.e. "the upper layers" shouldn’t need to modified.
>
>Yeah do not touch the upper layer is one of the goals. By design, spill device is a piece of private information to that particular OSD, the rest of Lustre stack won’t be able to see it.

We would somehow expose the existence of the spill device
via `lfs df`, so the stat from client to server would have to change?
And you mention some difficulty supporting s3fs/gcsfuse
with OFD. So you anticipate at least some upper layer changes?

>>> OSDs will access spilling devices through VFS interfaces in the kernel
>>> space. Therefore, the spilling device must be mountable into the
>>> kernel namespace. Initially, only HDD is supported, and the
>>> implementation can be extended to S3 and GCS using s3fs and gcsfuse
>>> respectively in the future.
>>
>> This isn't clear to me. Once we support HDD, what incremental work
>> Is needed to support s3fs/gcsfuse/etc? As long as they present a
>> normal filesystem API, they should all work the same?
>>
>> This begs the question: if we're already doing this work to support
>> writing Lustre objects to any arbitrary filesystem via VFS and we're only
>> intending to support OSTs with this proposal, why not implement
>> an OST-only VFS OSD and handle tiering in the filesystem layer?
>
>VFS OSD won’t give us everything to make it an OSD. Transaction is one of the issue as Oleg mentioned.

We'll have to implement a lot of complexity in kernel space for
the design you're suggesting. At some point, implementing
a transaction log and delegating the rest to user space might
be the more maintainable option? Doing that via VFS/Fuse is
one option. Something similar to ublk for OSD could be another
option.

I'm not saying this is preferrable, but is this something you've
considered?

>  Extended attributes would be another thing since not all file systems would support it.

I think it's fine not accept filesystems that don't support EA. Or have the
OSD advertise that it doesn't support EA.

>>> There’s no on-demand migration feature. If there’s no available space
>>> while a large chunk of data is being written, the system will return
>>> ENOSPC to the client. This simplifies the granting mechanism on the
>>> OFD significantly. Another reason for this approach is that writing
>>> performance is more predictable. Otherwise, the system might have to
>>> wait for an object to be migrated on demand, leading to much higher
>>> latency.
>>
>> I think users would tolerate slow jobs more than write failures because
>> the OSD isn't smart enough to write to the empty spill device.
>
>The only reason that user uses Lustre is because of performance. If a misconfiguration would lead to 100 seconds of latency, that is definitely not they would expect. This product is not designed for >that use case.

So if you want to write faster than the migration daemon can
free space, you have to accept a write failure? I can't really
think of a workload where this is preferable. I think this
behavior should be tunable, at least.

>>> deliver better performance than block-level cache with dmcache, where
>>> recovery time is lengthy if the cache size is huge.
>>
>> This seems speculative. Could you elaborate more on why you think
>> this is the case?
>
> DMcache doesn’t persist the bitmap used to indicate which blocks are holding dirty data, so an ungraceful shutdown will lead to a scanning of the entire cache in order to determine which blocks are > dirty.

This is a failing of DMcache rather than an indication that this
problem can't be solved on the block layer.

Overall, I think the concept is interesting. It reminds me of how
Bcachefs handle multi-device support. Each device can be
designated as holding metadata or data replicas. And you
can control the promotion and migration between different
targets (all managed by a migration daemon). But this design is
too limited, IMHO. If we're going to accept the additional complexity
in the OSD, the solution has to be extensible. What if I want to
replicate to multiple targets? What if I want more than two tiers?
What if I want to transparently migrate data from one spill device to
another? We don't need this for the initial implementation, sure.
But these seem like natural extensions.

I think we need to use some kind of common API for the
different devices. Even if the spill device doesn't support atomic
transactions, I don't see why we couldn't still use the common OSD
API and implement the migration daemon as a stacking driver on-top
of that. The spill device OSD driver could be made to advertise that it
doesn't support atomic transactions and may not support EA. But we
get the added benefit of being able to use existing OSDs with this
feature, pretty much for free.