[lustre-devel] RFC: Spill device for Lustre OSD

Jinshan Xiong jinshan.xiong at gmail.com
Mon Nov 3 11:51:29 PST 2025



> On Nov 2, 2025, at 21:50, Andreas Dilger <adilger at ddn.com> wrote:
> 
>  On Nov 2, 2025, at 16:22, Jinshan Xiong <jinshan.xiong at gmail.com> wrote:
>> Hi folks,
>> 
>> I came up with an idea to implement tiered storage for Lustre in a new
>> way. I'm sharing it here in order to get some feedback and decide if
>> it is worth pursuing. This is still in the early stage so it's just a
>> rough idea.
>> 
>> Thanks,
>> Jinshan
> 
> Jinshan,
> thanks for sending out this proposal.  As we discussed previously,
> I think it would be better for the long-term maintenance of Lustre if
> the spill device was also accessed via the Lustre OSD API instead
> of directly via the VFS, so that each of the OSDs implemented could
> be stacked on top of another (e.g. osd-memfs spilling to osd-ldiskfs
> on NVMe or HDD).

Yeah the major difference is that this spill device is not a fully functional OSD because we will support s3fs and gcsfuse, which means they won’t work with OFD on its own. We can discuss this further.

I created a PR at https://review.whamcloud.com/c/fs/lustre-release/+/62171 and we can discuss over there if it is easier.

> 
> I haven't looked into the details, but the state machine for managing
> the spill device objects seems similar to FLR mirror files, and using
> the existing LOV EA layout on the OST objects would reduce the
> amount of dedicated tools that need to be developed to access such
> files.

That would be complex by exporting those information to Lustre stack because we will have another entity that can initiate layout change from OST. Please have a look at the design and we can discuss if that approach is reasonable.

> 
> This seems similar to having the LOD layer split operations between
> a local OSD and a remote OSP, but it could split operations over
> two local OSD devices.
> 
> The other thing of interest in the other direction is the log-structured
> writes to migrated files, which might be useful for FLR mirrored or
> EC files.  This would allow replaying partial overwrites of existing
> files into the other mirrors/EC at a later time.

Yes, by remembering this in the hot tier so that resync will be much less expensive.

> 
>> # Spilling Device Proposal for Lustre OSD
>> 
>> ## Introduction
>> 
>> Spilling device is a private block device to an OSD in Lustre. It
>> allows an OSD to migrate infrequently accessed data objects to this
>> device. The key distinction between popular tiered storage is that it
>> only has a local view of the objects in a local OSD. Consequently, it
>> should make quick and accurate decisions based on the recent access
>> pattern. Only data objects can be spilled into the spilling device,
>> while metadata objects, such as OI namespace and last objid, remain in
>> the OSD permanently. This implies that the spilling device for OSD is
>> only applicable to OST OSDs. By design, the size of an OSD should be
>> sufficient to store the entire local workset, for instance, all the
>> data for an AI training job. A typical configuration would involve a
>> 1TB SSD OSD with 10TB of HDD as a spilling device.
>> 
>> Generic Lustre utilities won’t be able to directly access spilling
>> devices. For example, **lfs df** will only display the capacity of the
>> OSD device, while the capacity of the spilling device can be accessed
>> using specific options.
> 
> It should be possible to pass an extra OS_STATE_STATFS flag to
> return the extra capacity of the spill device in statfs()/lfs df output.
> 
>> OSDs will access spilling devices through VFS interfaces in the kernel
>> space. Therefore, the spilling device must be mountable into the
>> kernel namespace. Initially, only HDD is supported, and the
>> implementation can be extended to S3 and GCS using s3fs and gcsfuse
>> respectively in the future.
>> 
>> ## Architecture
>> 
>> When an OST object is spilled, it leaves a zero-lengthed stub object
>> with a special EA, called spill-ea, in the OSD. The spill-ea is used
>> to track the status of the object in the OSD and also store the
>> metadata of the spilled object in the spilling device.
>> 
>> Status of the object:
>> - **MIGRATING**: The object data is being copied to the spilled object.
>> - **MIGRATED**: The object data exists in both the OSD and the
>> spilling device, and the contents are synchronized.
>> - **RELEASED**: The object data is only in the spilling device, and it
>> leaves a stub object in the OSD.
>> - **DIRTY**: The object is released, but it has been written to
>> afterward, so the OSD retains some up-to-date data.
>> 
>> **Migrating**
>> 
>> Migration involves copying data into the spilling device. A daemon
>> process in the OSD continuously scans candidates for migration. The
>> typical policy for migration is based on the last access time and size
>> of the objects. For instance, if an object hasn’t been accessed in the
>> past day and its size exceeds 10MB, it will create an object in the
>> spilling device and copy its contents over.
>> 
>> There’s no on-demand migration feature. If there’s no available space
>> while a large chunk of data is being written, the system will return
>> ENOSPC to the client. This simplifies the granting mechanism on the
>> OFD significantly. Another reason for this approach is that writing
>> performance is more predictable. Otherwise, the system might have to
>> wait for an object to be migrated on demand, leading to much higher
>> latency.
>> 
>> **Releasing**
>> 
>> Once the data is migrated, the daemon can release the object by
>> truncating it in the OSD and setting the spill-ea status to
>> **RELEASED**, atomically.
>> 
>> Typically, an object is released only when the OSD’s available space
>> becomes limited. This allows us to keep as much data as possible in
>> the OSD.
>> 
>> **Restoring**
>> 
>> Restoring involves copying data back to the OSD device. It’s triggered
>> by reading from or writing to a released object and is managed by the
>> daemon, so it doesn’t interfere with the critical code path handling
>> read and write operations from the OFD.
>> 
>> When restoring, the object data is first copied to a temporary file
>> (`O_TMPFILE`). Afterward, it’s renamed to switch the inode in OI and
>> reset the spill-ea status atomically.
>> 
>> As mentioned earlier, the spilling device is operated with standard
>> VFS interfaces. Therefore, the OSD can directly read and write the
>> object in the spilling device. We tend to read a released file
>> directly in the first few reads. Writing is handled separately and
>> will be covered later.
>> 
>> Maintaining a synchronized state between two systems is a challenge.
>> We’ll leverage llog extensively to achieve this.
>> 
>> ## Operations Handling
>> 
>> **write**
>> 
>> If an object lacks a spill EA, it means there’s no spilled object in
>> the spilling device. In such cases, the write operation proceeds
>> normally.
>> 
>> If an object is in the **MIGRATING** or **MIGRATED** state, it writes
>> directly to the object and deletes the spill EA by writing llog.
>> 
>> If an object is in the **RELEASED** state, it writes the data to the
>> object in log-structured format and sets its status to **DIRTY**.
>> 
>> If an object is in the **DIRTY** state, it appends a new log entry to
>> the object. If the number of logs exceeds a predefined limit, it
>> triggers a restoration process.
>> 
>> **read**
>> 
>> If an object lacks a spill EA, it means there’s no spilled object in
>> the spilling device. In such cases, the read operation proceeds
>> normally.
>> 
>> If an object is in the **MIGRATING** or **MIGRATED**, it’s read directly.
>> 
>> If an object is in the **DIRTY** state, it checks if the read range
>> overlaps with the entries in the log.
>> 
>> If an object is in the **RELEASED** state, it reads the data directly
>> from the spilled object. If the number of reads to this object exceeds
>> a predefined limit, it initiates a restoration process and sets the
>> object’s status to **MIGRATED**.
>> 
>> **truncate and unlink**
>> 
>> If an object has a spill EA and is being truncated or deleted, it
>> writes an llog entry to ensure the spilled object is eventually
>> deleted.
>> 
>> **migrate**
>> 
>> Before initiating the migration of an object, it’s crucial to make the
>> spilled object a known state in order to prevent remote object
>> leakage. Therefore, it’s essential to set the spill EA first and then
>> create the spilled object after the transaction is committed. The
>> challenging aspect is that it's not possible to have any information
>> about the spilled object before its creation.
>> 
>> We decided to store the object in a known path. For instance, object
>> paths like `<fsname>/OST<index>/<oid:0:2>/<oid:2:>` seem reasonable.
>> This also means that the OSD must own the spilling device entirely.
>> Having foreign objects would cause confusion.
> 
> While my preference is that we use a regular OSD for the spill storage,
> I don't see why the exclusive access to this directory tree also needs
> exclusive use to the whole block device?  As long as the directory
> itself is not being modified by other applications then it should be OK?

That’s right. I just don’t want someone accidentally create a conflicting file. Even though it’s unlikely. It’s easier to just solely own it.

Btw, the design will lead to an implementation that a single gcs bucket is used by all OSDs in a Lustre instance.

> 
>> **restore**
>> 
>> Restoration will use a temporary file, which is essentially a file
>> created under an orphan directory. Once the copy from spilled object
>> is complete, it will also deal the write log entries in the original
>> object. If everything is done, it will rename the temporary file and
>> the corresponding OI file, and append a log entry to delete the
>> spilled object, all in a local transaction.
>> 
>> ## Implementation
>> 
>> OSD APIs are well-maintained and provide dedicated APIs for object
>> manipulation and body operations to interact with object content.
>> Spill device configuration  will be stored in the configuration log so
>> that the OSD will know if it has spilling device in the device
>> initialization.
>> 
>> If a spill device exists for an OSD, the object and body operations
>> will be redirected to a new set of OSD APIs. These APIs essentially
>> check for the existence of a spill EA before proceeding with any
>> operations. If local objects are used, the original OSD APIs will
>> still be invoked.
>> 
>> A new set of Lustre utilities will be developed to display information
>> about spill devices.
>> 
>> ## Conclusion
>> 
>> This solution gives a possible solution trying to address the
>> painpoints of prevalent tiered storage using mirroring, where a
>> full-flavoured policy engine has to run on dedicated clients, and
>> deliver better performance than block-level cache with dmcache, where
>> recovery time is lengthy if the cache size is huge.
> 
> Cheers, Andreas
>> Andreas Dilger
> Lustre Principal Architect
> Whamcloud/DDN
> 
> 
> 
> 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-devel-lustre.org/attachments/20251103/2531157e/attachment.htm>


More information about the lustre-devel mailing list