<html class="apple-mail-supports-explicit-dark-mode"><head><meta http-equiv="content-type" content="text/html; charset=utf-8"></head><body dir="auto"><div dir="ltr"></div><div dir="ltr"><br></div><div dir="ltr"><br><blockquote type="cite">On Nov 2, 2025, at 21:50, Andreas Dilger <adilger@ddn.com> wrote:<br><br></blockquote></div><blockquote type="cite"><div dir="ltr">

<meta http-equiv="Content-Type" content="text/html; charset=utf-8">

On Nov 2, 2025, at 16:22, Jinshan Xiong <jinshan.xiong@gmail.com> wrote:

<div>

<blockquote type="cite">

<div>

<div>Hi folks,<br>

<br>

I came up with an idea to implement tiered storage for Lustre in a new<br>

way. I'm sharing it here in order to get some feedback and decide if<br>

it is worth pursuing. This is still in the early stage so it's just a<br>

rough idea.<br>

<br>

Thanks,<br>

Jinshan<br>

</div>

</div>

</blockquote>

<div><br>

</div>

Jinshan,</div>

<div>thanks for sending out this proposal.  As we discussed previously,</div>

<div>I think it would be better for the long-term maintenance of Lustre if</div>

<div>the spill device was also accessed via the Lustre OSD API instead</div>

<div>of directly via the VFS, so that each of the OSDs implemented could</div>

<div>be stacked on top of another (e.g. osd-memfs spilling to osd-ldiskfs</div>

<div>on NVMe or HDD).</div></div></blockquote><div><br></div><div>Yeah the major difference is that this spill device is not a fully functional OSD because we will support s3fs and gcsfuse, which means they won’t work with OFD on its own. We can discuss this further.</div><div><br></div><div>I created a PR at <a href="https://review.whamcloud.com/c/fs/lustre-release/+/62171" style="font-family: "Times New Roman"; -webkit-text-stroke-color: rgb(0, 0, 238);">https://review.whamcloud.com/c/fs/lustre-release/+/62171</a> and we can discuss over there if it is easier.</div><br><blockquote type="cite"><div dir="ltr">

<div><br>

</div>

<div>I haven't looked into the details, but the state machine for managing</div>

<div>the spill device objects seems similar to FLR mirror files, and using</div>

<div>the existing LOV EA layout on the OST objects would reduce the</div>

<div>amount of dedicated tools that need to be developed to access such</div>

<div>files.</div></div></blockquote><div><br></div><div>That would be complex by exporting those information to Lustre stack because we will have another entity that can initiate layout change from OST. Please have a look at the design and we can discuss if that approach is reasonable.</div><br><blockquote type="cite"><div dir="ltr">

<div><br>

</div>

<div>This seems similar to having the LOD layer split operations between</div>

<div>a local OSD and a remote OSP, but it could split operations over</div>

<div>two local OSD devices.</div>

<div><br>

</div>

<div>The other thing of interest in the other direction is the log-structured</div>

<div>writes to migrated files, which might be useful for FLR mirrored or</div>

<div>EC files.  This would allow replaying partial overwrites of existing</div>

<div>files into the other mirrors/EC at a later time.</div></div></blockquote><div><br></div><div>Yes, by remembering this in the hot tier so that resync will be much less expensive.</div><br><blockquote type="cite"><div dir="ltr">

<div><br>

<blockquote type="cite">

<div>

<div># Spilling Device Proposal for Lustre OSD<br>

<br>

## Introduction<br>

<br>

Spilling device is a private block device to an OSD in Lustre. It<br>

allows an OSD to migrate infrequently accessed data objects to this<br>

device. The key distinction between popular tiered storage is that it<br>

only has a local view of the objects in a local OSD. Consequently, it<br>

should make quick and accurate decisions based on the recent access<br>

pattern. Only data objects can be spilled into the spilling device,<br>

while metadata objects, such as OI namespace and last objid, remain in<br>

the OSD permanently. This implies that the spilling device for OSD is<br>

only applicable to OST OSDs. By design, the size of an OSD should be<br>

sufficient to store the entire local workset, for instance, all the<br>

data for an AI training job. A typical configuration would involve a<br>

1TB SSD OSD with 10TB of HDD as a spilling device.<br>

<br>

Generic Lustre utilities won’t be able to directly access spilling<br>

devices. For example, **lfs df** will only display the capacity of the<br>

OSD device, while the capacity of the spilling device can be accessed<br>

using specific options.<br>

</div>

</div>

</blockquote>

<div><br>

</div>

It should be possible to pass an extra OS_STATE_STATFS flag to</div>

<div>return the extra capacity of the spill device in statfs()/lfs df output.</div>

<div><br>

<blockquote type="cite">

<div>

<div>OSDs will access spilling devices through VFS interfaces in the kernel<br>

space. Therefore, the spilling device must be mountable into the<br>

kernel namespace. Initially, only HDD is supported, and the<br>

implementation can be extended to S3 and GCS using s3fs and gcsfuse<br>

respectively in the future.<br>

<br>

## Architecture<br>

<br>

When an OST object is spilled, it leaves a zero-lengthed stub object<br>

with a special EA, called spill-ea, in the OSD. The spill-ea is used<br>

to track the status of the object in the OSD and also store the<br>

metadata of the spilled object in the spilling device.<br>

<br>

Status of the object:<br>

- **MIGRATING**: The object data is being copied to the spilled object.<br>

- **MIGRATED**: The object data exists in both the OSD and the<br>

spilling device, and the contents are synchronized.<br>

- **RELEASED**: The object data is only in the spilling device, and it<br>

leaves a stub object in the OSD.<br>

- **DIRTY**: The object is released, but it has been written to<br>

afterward, so the OSD retains some up-to-date data.<br>

<br>

**Migrating**<br>

<br>

Migration involves copying data into the spilling device. A daemon<br>

process in the OSD continuously scans candidates for migration. The<br>

typical policy for migration is based on the last access time and size<br>

of the objects. For instance, if an object hasn’t been accessed in the<br>

past day and its size exceeds 10MB, it will create an object in the<br>

spilling device and copy its contents over.<br>

<br>

There’s no on-demand migration feature. If there’s no available space<br>

while a large chunk of data is being written, the system will return<br>

ENOSPC to the client. This simplifies the granting mechanism on the<br>

OFD significantly. Another reason for this approach is that writing<br>

performance is more predictable. Otherwise, the system might have to<br>

wait for an object to be migrated on demand, leading to much higher<br>

latency.<br>

<br>

**Releasing**<br>

<br>

Once the data is migrated, the daemon can release the object by<br>

truncating it in the OSD and setting the spill-ea status to<br>

**RELEASED**, atomically.<br>

<br>

Typically, an object is released only when the OSD’s available space<br>

becomes limited. This allows us to keep as much data as possible in<br>

the OSD.<br>

<br>

**Restoring**<br>

<br>

Restoring involves copying data back to the OSD device. It’s triggered<br>

by reading from or writing to a released object and is managed by the<br>

daemon, so it doesn’t interfere with the critical code path handling<br>

read and write operations from the OFD.<br>

<br>

When restoring, the object data is first copied to a temporary file<br>

(`O_TMPFILE`). Afterward, it’s renamed to switch the inode in OI and<br>

reset the spill-ea status atomically.<br>

<br>

As mentioned earlier, the spilling device is operated with standard<br>

VFS interfaces. Therefore, the OSD can directly read and write the<br>

object in the spilling device. We tend to read a released file<br>

directly in the first few reads. Writing is handled separately and<br>

will be covered later.<br>

<br>

Maintaining a synchronized state between two systems is a challenge.<br>

We’ll leverage llog extensively to achieve this.<br>

<br>

## Operations Handling<br>

<br>

**write**<br>

<br>

If an object lacks a spill EA, it means there’s no spilled object in<br>

the spilling device. In such cases, the write operation proceeds<br>

normally.<br>

<br>

If an object is in the **MIGRATING** or **MIGRATED** state, it writes<br>

directly to the object and deletes the spill EA by writing llog.<br>

<br>

If an object is in the **RELEASED** state, it writes the data to the<br>

object in log-structured format and sets its status to **DIRTY**.<br>

<br>

If an object is in the **DIRTY** state, it appends a new log entry to<br>

the object. If the number of logs exceeds a predefined limit, it<br>

triggers a restoration process.<br>

<br>

**read**<br>

<br>

If an object lacks a spill EA, it means there’s no spilled object in<br>

the spilling device. In such cases, the read operation proceeds<br>

normally.<br>

<br>

If an object is in the **MIGRATING** or **MIGRATED**, it’s read directly.<br>

<br>

If an object is in the **DIRTY** state, it checks if the read range<br>

overlaps with the entries in the log.<br>

<br>

If an object is in the **RELEASED** state, it reads the data directly<br>

from the spilled object. If the number of reads to this object exceeds<br>

a predefined limit, it initiates a restoration process and sets the<br>

object’s status to **MIGRATED**.<br>

<br>

**truncate and unlink**<br>

<br>

If an object has a spill EA and is being truncated or deleted, it<br>

writes an llog entry to ensure the spilled object is eventually<br>

deleted.<br>

<br>

**migrate**<br>

<br>

Before initiating the migration of an object, it’s crucial to make the<br>

spilled object a known state in order to prevent remote object<br>

leakage. Therefore, it’s essential to set the spill EA first and then<br>

create the spilled object after the transaction is committed. The<br>

challenging aspect is that it's not possible to have any information<br>

about the spilled object before its creation.<br>

<br>

We decided to store the object in a known path. For instance, object<br>

paths like `<fsname>/OST<index>/<oid:0:2>/<oid:2:>` seem reasonable.<br>

This also means that the OSD must own the spilling device entirely.<br>

Having foreign objects would cause confusion.<br>

</div>

</div>

</blockquote>

<div><br>

</div>

<div>While my preference is that we use a regular OSD for the spill storage,</div>

I don't see why the exclusive access to this directory tree also needs</div>

<div>exclusive use to the whole block device?  As long as the directory</div>

<div>itself is not being modified by other applications then it should be OK?</div></div></blockquote><div><br></div>That’s right. I just don’t want someone accidentally create a conflicting file. Even though it’s unlikely. It’s easier to just solely own it.<div><br></div><div>Btw, the design will lead to an implementation that a single gcs bucket is used by all OSDs in a Lustre instance.<br><div><br><blockquote type="cite"><div dir="ltr">

<div><br>

<blockquote type="cite">

<div>

<div>**restore**<br>

<br>

Restoration will use a temporary file, which is essentially a file<br>

created under an orphan directory. Once the copy from spilled object<br>

is complete, it will also deal the write log entries in the original<br>

object. If everything is done, it will rename the temporary file and<br>

the corresponding OI file, and append a log entry to delete the<br>

spilled object, all in a local transaction.<br>

<br>

## Implementation<br>

<br>

OSD APIs are well-maintained and provide dedicated APIs for object<br>

manipulation and body operations to interact with object content.<br>

Spill device configuration  will be stored in the configuration log so<br>

that the OSD will know if it has spilling device in the device<br>

initialization.<br>

<br>

If a spill device exists for an OSD, the object and body operations<br>

will be redirected to a new set of OSD APIs. These APIs essentially<br>

check for the existence of a spill EA before proceeding with any<br>

operations. If local objects are used, the original OSD APIs will<br>

still be invoked.<br>

<br>

A new set of Lustre utilities will be developed to display information<br>

about spill devices.<br>

<br>

## Conclusion<br>

<br>

This solution gives a possible solution trying to address the<br>

painpoints of prevalent tiered storage using mirroring, where a<br>

full-flavoured policy engine has to run on dedicated clients, and<br>

deliver better performance than block-level cache with dmcache, where<br>

recovery time is lengthy if the cache size is huge.<br>

</div>

</div>

</blockquote>

</div>

<br>

<div>

<div dir="auto" style="caret-color: rgb(0, 0, 0); color: rgb(0, 0, 0); letter-spacing: normal; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration: none; overflow-wrap: break-word; -webkit-nbsp-mode: space; line-break: after-white-space;">

<div dir="auto" style="caret-color: rgb(0, 0, 0); color: rgb(0, 0, 0); letter-spacing: normal; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration: none; overflow-wrap: break-word; -webkit-nbsp-mode: space; line-break: after-white-space;">

<div>Cheers, Andreas</div>

<div>—</div>

<div>Andreas Dilger</div>

<div>Lustre Principal Architect</div>

<div>Whamcloud/DDN</div>

</div>

<br class="Apple-interchange-newline">

</div>

<br class="Apple-interchange-newline">

<br class="Apple-interchange-newline">

</div>

<br>

</div></blockquote></div></div></body></html>