[lustre-devel] RFC: Spill device for Lustre OSD

Mon Nov 3 15:57:52 PST 2025

> On Nov 3, 2025, at 13:59, Day, Timothy <timday at amazon.com> wrote:
> 
> 
>> 
>> Spilling device is a private block device to an OSD in Lustre.
> 
> Since we're accessing the spill device via VFS, is there any requirement
> that the spill device is actually a block device? This could just be any
> filesystem?

Technically any file systems. Initially it will just support HDDs.

> 
>> allows an OSD to migrate infrequently accessed data objects to this
>> device. The key distinction between popular tiered storage is that it
>> only has a local view of the objects in a local OSD. Consequently, it
>> should make quick and accurate decisions based on the recent access
>> pattern. Only data objects can be spilled into the spilling device,
>> while metadata objects, such as OI namespace and last objid, remain in
>> the OSD permanently. This implies that the spilling device for OSD is
>> only applicable to OST OSDs.
> 
> We’d need to account for Data-on-Metadata as well. Techically, llogs are
> data objects as well. But I doubt we'd want to migrate those off the primary
> storage.
> 
>> Generic Lustre utilities won’t be able to directly access spilling
>> devices. For example, **lfs df** will only display the capacity of the
>> OSD device, while the capacity of the spilling device can be accessed
>> using specific options.
> 
> I don't see why there has to be a separation between the spill device
> and the OSD device. I think it's a simpler experience for end-users
> if these details are handled transparently. Ideally, all of the tiering
> complexity would be handled below the OSD API boundary. Lustre itself
> i.e. "the upper layers" shouldn’t need to modified.

Yeah do not touch the upper layer is one of the goals. By design, spill device is a piece of private information to that particular OSD, the rest of Lustre stack won’t be able to see it.

> 
>> OSDs will access spilling devices through VFS interfaces in the kernel
>> space. Therefore, the spilling device must be mountable into the
>> kernel namespace. Initially, only HDD is supported, and the
>> implementation can be extended to S3 and GCS using s3fs and gcsfuse
>> respectively in the future.
> 
> This isn't clear to me. Once we support HDD, what incremental work
> Is needed to support s3fs/gcsfuse/etc? As long as they present a
> normal filesystem API, they should all work the same?
> 
> This begs the question: if we're already doing this work to support
> writing Lustre objects to any arbitrary filesystem via VFS and we're only
> intending to support OSTs with this proposal, why not implement
> an OST-only VFS OSD and handle tiering in the filesystem layer?

VFS OSD won’t give us everything to make it an OSD. Transaction is one of the issue as Oleg mentioned. Extended attributes would be another thing since not all file systems would support it. 

> 
> The kernel component would be much simpler. And we'd be able to
> support a lot more complexity in user space with FUSE.
> 
> If we still wanted to handle the tiering/migration in kernel space,
> then a VFS OSD would allow us to do that purely using the OSD
> APIs, similar to what Andreas was suggesting.
> 
>> There’s no on-demand migration feature. If there’s no available space
>> while a large chunk of data is being written, the system will return
>> ENOSPC to the client. This simplifies the granting mechanism on the
>> OFD significantly. Another reason for this approach is that writing
>> performance is more predictable. Otherwise, the system might have to
>> wait for an object to be migrated on demand, leading to much higher
>> latency.
> 
> I think users would tolerate slow jobs more than write failures because
> the OSD isn't smart enough to write to the empty spill device.

The only reason that user uses Lustre is because of performance. If a misconfiguration would lead to 100 seconds of latency, that is definitely not they would expect. This product is not designed for that use case.

> 
>> ## Conclusion
>> 
>> This solution gives a possible solution trying to address the
>> painpoints of prevalent tiered storage using mirroring, where a
>> full-flavoured policy engine has to run on dedicated clients, and
> 
> The client still has to do some manual management? i.e. you can't
> write indefinitely - the client would have to periodically force a write
> to the spill device or it might hit ENOSPC? This is better than the
> pain of manually managing mirrors, but it's not a complete solution.

The daemon running on the OSD should be able to move cold data out to spill device and free up space. The client won’t do anything.

If users workload hits ENOSPC, that implies that their size of their workset exceeds the size of their Lustre instance. 

> 
>> deliver better performance than block-level cache with dmcache, where
>> recovery time is lengthy if the cache size is huge.
> 
> This seems speculative. Could you elaborate more on why you think
> this is the case?

DMcache doesn’t persist the bitmap used to indicate which blocks are holding dirty data, so an ungraceful shutdown will lead to a scanning of the entire cache in order to determine which blocks are dirty.

>