[lustre-devel] RFC: Spill device for Lustre OSD

Jinshan Xiong jinshanx at google.com
Tue Nov 4 09:47:36 PST 2025


On Mon, Nov 3, 2025 at 5:58 PM Oleg Drokin <green at whamcloud.com> wrote:

> On Mon, 2025-11-03 at 16:33 -0800, Jinshan Xiong wrote:
> > > > >
> > > > > I am not sure it's a much better idea than the already existing
> > > > > HSM
> > > > > capabilities we have that would allow you to have "offline"
> > > > > objects
> > > > > that would be pulled back in when used, but are otherwise just
> > > > > visible
> > > > > in the metadata only.
> > > > > The underlying capabilities are pretty rich esp. if we also
> > > > > take
> > > > > into
> > > > > account the eventual WBC stuff.
> > > >
> > > > The major problem of current HSM is that it has to have dedicated
> > > > clients to move data. Also, scanning the entire Lustre file
> > > > system
> > >
> > > This (dedicated client) is an implementation detail. It could be
> > > improved in many ways and the effort spent on this would bring
> > > great
> > > benefit to everyone?
> >
> >
> > Almost all designs assume some upfront implementation. We (as the
> > Lustre team) considered running clients on OST nodes, but cloud users
> > are sensitive about their data being exposed elsewhere.
> >
> > Can you list a few improvements that come to mind?
>
> Clients between server nodes is probably one of the most obvious
> choices indeed. Considering the data already resides on those nodes, I
> am not sure I understand the concerns about "exposing" data that's
> already on those nodes. If customers are so sensitive, we support data
> encryption.
> We could also do some direct server-server migration of some sort where
> OSTs exchange data without bringing up real clients and doing copies
> from userspace. That might be desirable for other reasons for future
> functionality (e.g. various caching things people have been long
> envisioning)


This is going too far ;-)


>


> > >
> > > > takes very long time so it resorts to databases in order to make
> > > > correct decisions about which file should be released. By the
> > > > time,
> > > > the two system will be out of sync. That makes it practically
> > > > unusable.
> > >
> > > This again is an implementation detail, not even hardcoded
> > > anywhere.
> > > How do you plan for the OST to to know what stuff is not used
> > > without
> > > resorting to some database or scan? Now take this method and make
> > > it
> > > report "upstream" where currently HSM implementations resort to
> > > databases or scans.
> > >
> >
> > The assumption is that OST sizes are relatively small, up to 100TB.
> > Also, scanning local devices in kernel spaces is much faster. So yeah
> > there is no database in the way.
>
> I am not sure why? In general small OSTs are a relatively rare thing
> because to reach large FS sizes you would need to many of them, space
> balancing becomes a chore and so on. So relatively few do it for some
> fringe reasons (e.g. Google) . Majority of people prefer large OSTs.
>
> Also nothing stops you from doing a per-OST scan (when you do have
> small OSTs) and then kicking the results up to the acting agent to do
> something about it (or the other way around, the hsm engine can ask
> OSTs one by one (picking less busy ones or ones that have the least
> free space, or some other factor). And there's absolutely no need to
> wait out to query all OSTs, you can get results from one and work on
> the data from it while the other OSTs are still thinking (or not,
> there's absolutely no requirement to get full filesystem data before
> making any decisions).
>
> > I guess users won't have 1PB OSTs, will they?
>
> There probably are already? NASA has a known 0.5P OST configuration:
>
> https://www.nas.nasa.gov/hecc/support/kb/lustre-progressive-file-layout-(pfl)-with-ssd-and-hdd-pools_680.html#:~:text=The%20available%20SSD%20space%20in%20each%20filesystem,decimal%20(far%20right)%20labels%20of%20each%20OST
> .
>
> > > Rereading your proposal, I see that this particular detail is not
> > > covered and it's just assumed that "infrequently accessed data"
> > > would
> > > be somehow known.
> >
> > I should have mentioned that in the migration section. Also, we need
> > to slightly update the OST read to use a local transaction to update
> > an object's access time (atime) if it's older than a predefined
> > threshold, for example, 10 minutes.
>
> This is going to be fragile in the face of varying clock times on
> different clients potentially not synced with the servers.
> Also in the face of -o noatime.
>

It doesn't use client timestamps. Also, it won't be part of read because
read doesn't initiate a transaction.

It can simply use the OSS local time and start a local transaction to
update the atime.


>
> But yes, I guess it's one way to get this "on the cheap", and the other
> trouble I foresee is you are going to have a biased set. Only recently
> touched objects (so with fresh atime), unless you plan to retain a
> database and update it from such transaction flow, which certainly is
> possible, but I am not sure how practical vs some sort of a scan.
>

I don't see that as an issue. It's going to update the atime in memory. And
yeah if the OSS is crashed and the scanner may choose a wrong file to
migrate, this should be rare and I don't think this would become a severe
issue.


>
> > > > > If the argument is "but OSTs know best what stuff is used"
> > > > > (which I
> > > > > am
> > > > > not sure I buy, after all before you could use something off
> > > > > OSTs
> > > > > you
> > > > > need to open a file I would hope) even then OSTs could just
> > > > > signal
> > > > > a
> > > > > list of "inactive objects" that then a higher level system
> > > > > would
> > > > > take
> > > > > care of by relocatiing somewhere more sensical and changing the
> > > > > layout
> > > > > to indicate those objects now live elsewhere.
> > > > >
> > > > > The plus here is you don't need to attach this "Wart" to every
> > > > > OST
> > > > > and
> > > > > configure it everywhere and such, but rather have a central
> > > > > location
> > > > > that is centrally managed.
> > > > >
> > >
> > > _______________________________________________
> > > lustre-devel mailing list
> > > lustre-devel at lists.lustre.org
> > > http://lists.lustre.org/listinfo.cgi/lustre-devel-lustre.org
> > >
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-devel-lustre.org/attachments/20251104/4b387281/attachment.htm>


More information about the lustre-devel mailing list