[lustre-devel] RFC: Spill device for Lustre OSD

Jinshan Xiong jinshanx at google.com
Mon Nov 3 16:33:43 PST 2025


On Mon, Nov 3, 2025 at 4:22 PM Oleg Drokin via lustre-devel <
lustre-devel at lists.lustre.org> wrote:

> On Mon, 2025-11-03 at 16:04 -0800, Jinshan Xiong wrote:
> >
> >
> > > On Nov 3, 2025, at 15:14, Oleg Drokin <green at whamcloud.com> wrote:
> > >
> > > On Mon, 2025-11-03 at 21:59 +0000, Day, Timothy via lustre-devel
> > > wrote:
> > > >
> > > >
> > > > This begs the question: if we're already doing this work to
> > > > support
> > > > writing Lustre objects to any arbitrary filesystem via VFS and
> > > > we're
> > > > only
> > > > intending to support OSTs with this proposal, why not implement
> > > > an OST-only VFS OSD and handle tiering in the filesystem layer?
> > >
> > > The problem with pure VFS is it does not actually provide us what
> > > we
> > > want.
> > > So OSD talks to the underlying FS via VFS + some more stuff (we do
> > > have
> > > the hidden mount for ldiskfs after all).
> > > The "more stuff" is things like expanded transaction boundaries
> > > beyond
> > > what posix requires so we can update more than one thing.
> > > If Linux VFS provided all these abilities we would not need to
> > > really
> > > know much about the underlying disk fs I suspect.
> > >
> > > But currently it's just a way to add OSTs, not move objects
> > > laterally
> > > from one OST to another and hence this proposal I imagine - where
> > > OSTs
> > > would grow "warts" for less wanted data.
> > >
> > > I am not sure it's a much better idea than the already existing HSM
> > > capabilities we have that would allow you to have "offline" objects
> > > that would be pulled back in when used, but are otherwise just
> > > visible
> > > in the metadata only.
> > > The underlying capabilities are pretty rich esp. if we also take
> > > into
> > > account the eventual WBC stuff.
> >
> > The major problem of current HSM is that it has to have dedicated
> > clients to move data. Also, scanning the entire Lustre file system
>
> This (dedicated client) is an implementation detail. It could be
> improved in many ways and the effort spent on this would bring great
> benefit to everyone?
>

Almost all designs assume some upfront implementation. We (as the Lustre
team) considered running clients on OST nodes, but cloud users are
sensitive about their data being exposed elsewhere.

Can you list a few improvements that come to mind?


>
> > takes very long time so it resorts to databases in order to make
> > correct decisions about which file should be released. By the time,
> > the two system will be out of sync. That makes it practically
> > unusable.
>
> This again is an implementation detail, not even hardcoded anywhere.
> How do you plan for the OST to to know what stuff is not used without
> resorting to some database or scan? Now take this method and make it
> report "upstream" where currently HSM implementations resort to
> databases or scans.
>

The assumption is that OST sizes are relatively small, up to 100TB. Also,
scanning local devices in kernel spaces is much faster. So yeah there is no
database in the way.

I guess users won't have 1PB OSTs, will they?


>
> Rereading your proposal, I see that this particular detail is not
> covered and it's just assumed that "infrequently accessed data" would
> be somehow known.
>

I should have mentioned that in the migration section. Also, we need to
slightly update the OST read to use a local transaction to update an
object's access time (atime) if it's older than a predefined threshold, for
example, 10 minutes.


>
> >
> > >
> > > If the argument is "but OSTs know best what stuff is used" (which I
> > > am
> > > not sure I buy, after all before you could use something off OSTs
> > > you
> > > need to open a file I would hope) even then OSTs could just signal
> > > a
> > > list of "inactive objects" that then a higher level system would
> > > take
> > > care of by relocatiing somewhere more sensical and changing the
> > > layout
> > > to indicate those objects now live elsewhere.
> > >
> > > The plus here is you don't need to attach this "Wart" to every OST
> > > and
> > > configure it everywhere and such, but rather have a central
> > > location
> > > that is centrally managed.
> > >
>
> _______________________________________________
> lustre-devel mailing list
> lustre-devel at lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-devel-lustre.org
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-devel-lustre.org/attachments/20251103/57bb634c/attachment.htm>


More information about the lustre-devel mailing list