[lustre-devel] RFC: Spill device for Lustre OSD

Sun Nov 2 15:22:25 PST 2025

Hi folks,

I came up with an idea to implement tiered storage for Lustre in a new
way. I'm sharing it here in order to get some feedback and decide if
it is worth pursuing. This is still in the early stage so it's just a
rough idea.

Thanks,
Jinshan

# Spilling Device Proposal for Lustre OSD

## Introduction

Spilling device is a private block device to an OSD in Lustre. It
allows an OSD to migrate infrequently accessed data objects to this
device. The key distinction between popular tiered storage is that it
only has a local view of the objects in a local OSD. Consequently, it
should make quick and accurate decisions based on the recent access
pattern. Only data objects can be spilled into the spilling device,
while metadata objects, such as OI namespace and last objid, remain in
the OSD permanently. This implies that the spilling device for OSD is
only applicable to OST OSDs. By design, the size of an OSD should be
sufficient to store the entire local workset, for instance, all the
data for an AI training job. A typical configuration would involve a
1TB SSD OSD with 10TB of HDD as a spilling device.

Generic Lustre utilities won’t be able to directly access spilling
devices. For example, **lfs df** will only display the capacity of the
OSD device, while the capacity of the spilling device can be accessed
using specific options.

OSDs will access spilling devices through VFS interfaces in the kernel
space. Therefore, the spilling device must be mountable into the
kernel namespace. Initially, only HDD is supported, and the
implementation can be extended to S3 and GCS using s3fs and gcsfuse
respectively in the future.

## Architecture

When an OST object is spilled, it leaves a zero-lengthed stub object
with a special EA, called spill-ea, in the OSD. The spill-ea is used
to track the status of the object in the OSD and also store the
metadata of the spilled object in the spilling device.

Status of the object:
- **MIGRATING**: The object data is being copied to the spilled object.
- **MIGRATED**: The object data exists in both the OSD and the
spilling device, and the contents are synchronized.
- **RELEASED**: The object data is only in the spilling device, and it
leaves a stub object in the OSD.
- **DIRTY**: The object is released, but it has been written to
afterward, so the OSD retains some up-to-date data.

**Migrating**

Migration involves copying data into the spilling device. A daemon
process in the OSD continuously scans candidates for migration. The
typical policy for migration is based on the last access time and size
of the objects. For instance, if an object hasn’t been accessed in the
past day and its size exceeds 10MB, it will create an object in the
spilling device and copy its contents over.

There’s no on-demand migration feature. If there’s no available space
while a large chunk of data is being written, the system will return
ENOSPC to the client. This simplifies the granting mechanism on the
OFD significantly. Another reason for this approach is that writing
performance is more predictable. Otherwise, the system might have to
wait for an object to be migrated on demand, leading to much higher
latency.

**Releasing**

Once the data is migrated, the daemon can release the object by
truncating it in the OSD and setting the spill-ea status to
**RELEASED**, atomically.

Typically, an object is released only when the OSD’s available space
becomes limited. This allows us to keep as much data as possible in
the OSD.

**Restoring**

Restoring involves copying data back to the OSD device. It’s triggered
by reading from or writing to a released object and is managed by the
daemon, so it doesn’t interfere with the critical code path handling
read and write operations from the OFD.

When restoring, the object data is first copied to a temporary file
(`O_TMPFILE`). Afterward, it’s renamed to switch the inode in OI and
reset the spill-ea status atomically.

As mentioned earlier, the spilling device is operated with standard
VFS interfaces. Therefore, the OSD can directly read and write the
object in the spilling device. We tend to read a released file
directly in the first few reads. Writing is handled separately and
will be covered later.

Maintaining a synchronized state between two systems is a challenge.
We’ll leverage llog extensively to achieve this.

## Operations Handling

**write**

If an object lacks a spill EA, it means there’s no spilled object in
the spilling device. In such cases, the write operation proceeds
normally.

If an object is in the **MIGRATING** or **MIGRATED** state, it writes
directly to the object and deletes the spill EA by writing llog.

If an object is in the **RELEASED** state, it writes the data to the
object in log-structured format and sets its status to **DIRTY**.

If an object is in the **DIRTY** state, it appends a new log entry to
the object. If the number of logs exceeds a predefined limit, it
triggers a restoration process.

**read**

If an object lacks a spill EA, it means there’s no spilled object in
the spilling device. In such cases, the read operation proceeds
normally.

If an object is in the **MIGRATING** or **MIGRATED**, it’s read directly.

If an object is in the **DIRTY** state, it checks if the read range
overlaps with the entries in the log.

If an object is in the **RELEASED** state, it reads the data directly
from the spilled object. If the number of reads to this object exceeds
a predefined limit, it initiates a restoration process and sets the
object’s status to **MIGRATED**.

**truncate and unlink**

If an object has a spill EA and is being truncated or deleted, it
writes an llog entry to ensure the spilled object is eventually
deleted.

**migrate**

Before initiating the migration of an object, it’s crucial to make the
spilled object a known state in order to prevent remote object
leakage. Therefore, it’s essential to set the spill EA first and then
create the spilled object after the transaction is committed. The
challenging aspect is that it's not possible to have any information
about the spilled object before its creation.

We decided to store the object in a known path. For instance, object
paths like `<fsname>/OST<index>/<oid:0:2>/<oid:2:>` seem reasonable.
This also means that the OSD must own the spilling device entirely.
Having foreign objects would cause confusion.

**restore**

Restoration will use a temporary file, which is essentially a file
created under an orphan directory. Once the copy from spilled object
is complete, it will also deal the write log entries in the original
object. If everything is done, it will rename the temporary file and
the corresponding OI file, and append a log entry to delete the
spilled object, all in a local transaction.

## Implementation

OSD APIs are well-maintained and provide dedicated APIs for object
manipulation and body operations to interact with object content.
Spill device configuration  will be stored in the configuration log so
that the OSD will know if it has spilling device in the device
initialization.

If a spill device exists for an OSD, the object and body operations
will be redirected to a new set of OSD APIs. These APIs essentially
check for the existence of a spill EA before proceeding with any
operations. If local objects are used, the original OSD APIs will
still be invoked.

A new set of Lustre utilities will be developed to display information
about spill devices.

## Conclusion

This solution gives a possible solution trying to address the
painpoints of prevalent tiered storage using mirroring, where a
full-flavoured policy engine has to run on dedicated clients, and
deliver better performance than block-level cache with dmcache, where
recovery time is lengthy if the cache size is huge.