[lustre-devel] RFC: Spill device for Lustre OSD

Day, Timothy timday at amazon.com
Mon Nov 3 13:59:06 PST 2025


> Spilling device is a private block device to an OSD in Lustre.

Since we're accessing the spill device via VFS, is there any requirement
that the spill device is actually a block device? This could just be any
filesystem?

> allows an OSD to migrate infrequently accessed data objects to this
> device. The key distinction between popular tiered storage is that it
> only has a local view of the objects in a local OSD. Consequently, it
> should make quick and accurate decisions based on the recent access
> pattern. Only data objects can be spilled into the spilling device,
> while metadata objects, such as OI namespace and last objid, remain in
> the OSD permanently. This implies that the spilling device for OSD is
> only applicable to OST OSDs.

We’d need to account for Data-on-Metadata as well. Techically, llogs are
data objects as well. But I doubt we'd want to migrate those off the primary
storage.

> Generic Lustre utilities won’t be able to directly access spilling
> devices. For example, **lfs df** will only display the capacity of the
> OSD device, while the capacity of the spilling device can be accessed
> using specific options.

I don't see why there has to be a separation between the spill device
and the OSD device. I think it's a simpler experience for end-users
if these details are handled transparently. Ideally, all of the tiering
complexity would be handled below the OSD API boundary. Lustre itself
i.e. "the upper layers" shouldn’t need to modified.

> OSDs will access spilling devices through VFS interfaces in the kernel
> space. Therefore, the spilling device must be mountable into the
> kernel namespace. Initially, only HDD is supported, and the
> implementation can be extended to S3 and GCS using s3fs and gcsfuse
> respectively in the future.

This isn't clear to me. Once we support HDD, what incremental work
Is needed to support s3fs/gcsfuse/etc? As long as they present a
normal filesystem API, they should all work the same?

This begs the question: if we're already doing this work to support
writing Lustre objects to any arbitrary filesystem via VFS and we're only
intending to support OSTs with this proposal, why not implement
an OST-only VFS OSD and handle tiering in the filesystem layer?

The kernel component would be much simpler. And we'd be able to
support a lot more complexity in user space with FUSE.

If we still wanted to handle the tiering/migration in kernel space,
then a VFS OSD would allow us to do that purely using the OSD
APIs, similar to what Andreas was suggesting.

> There’s no on-demand migration feature. If there’s no available space
> while a large chunk of data is being written, the system will return
> ENOSPC to the client. This simplifies the granting mechanism on the
> OFD significantly. Another reason for this approach is that writing
> performance is more predictable. Otherwise, the system might have to
> wait for an object to be migrated on demand, leading to much higher
> latency.

I think users would tolerate slow jobs more than write failures because
the OSD isn't smart enough to write to the empty spill device.

> ## Conclusion
>
> This solution gives a possible solution trying to address the
> painpoints of prevalent tiered storage using mirroring, where a
> full-flavoured policy engine has to run on dedicated clients, and

The client still has to do some manual management? i.e. you can't
write indefinitely - the client would have to periodically force a write
to the spill device or it might hit ENOSPC? This is better than the
pain of manually managing mirrors, but it's not a complete solution.

> deliver better performance than block-level cache with dmcache, where
> recovery time is lengthy if the cache size is huge.

This seems speculative. Could you elaborate more on why you think
this is the case?



More information about the lustre-devel mailing list