[lustre-devel] RFC: Spill device for Lustre OSD

Jinshan Xiong jinshanx at google.com
Tue Nov 4 15:54:32 PST 2025


On Tue, Nov 4, 2025 at 3:48 PM Andreas Dilger <adilger at dilger.ca> wrote:

>
> Timothy Day <timday at amazon.com> wrote:
> >>>> I haven’t seen any mention of failover yet in this conversation (may
> have missed it), but if the device is truly local, then in failed over
> configurations the data is inaccessible.  If it’s *not* local, why not just
> make the device part of the OST or an independent OST?
> >>>
> >>> It won't be local. Actually, this is designed for the cloud.
> >>
> >> I don't understand how 'local' is being used. Cloud or not, all of the
> >> Lustre client, servers, and backend storage service will be co-located
> >> in the same data center. I think Patrick is asking whether the spill
> >> device will be physically connected to OSS server, or be provided over
> >> something like SAN? Either way, presenting this device as an independent
> >> OST brings back the pain of manually managing data placement from the
> >> client - which this design is trying to avoid.
> >>
> >>> We already have tiered storage based on mirroring; however, that still
> requires clients to move data and a file system level scanner to decide
> which files move to the cold tier. It's cumbersome to maintain those
> clients.
> >>
> >> Agree, it's not ideal.
>
> Regardless of how the spill device is implemented, there will need to be
> some scanning of the front OSD device to find/manage objects to mirror
> and release.  This could be done directly on the OST with something like
> DDN's lipe_find3 utility, or older scanners like lester, zester, e2scan,
> etc. that scan the local ldiskfs block device directly.
>
> If the overhead of a local Lustre mount on the OSS is problematic, that
> seems like something which could/should be fixed?  The local mounts are
> already "non-recoverable" so that they do not get an entry in last_rcvd
> and their absence does not cause any recovery issues.
>
> The main issue we've seen with local mountpoints is that this can confuse
> HA and prevent Lustre module unloading if they are not taken into account
> during cleanup.
>

You're right. That's actually why we didn't do it in the first place. If an
OSS crashes, it will definitely lead to recovery timeout and client
eviction.


>
> Cheers, Andreas
>
>
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-devel-lustre.org/attachments/20251104/55a6c99e/attachment.htm>


More information about the lustre-devel mailing list