[lustre-devel] RFC: Spill device for Lustre OSD

Oleg Drokin green at whamcloud.com
Mon Nov 3 17:58:51 PST 2025


On Mon, 2025-11-03 at 16:33 -0800, Jinshan Xiong wrote:
> > > > 
> > > > I am not sure it's a much better idea than the already existing
> > > > HSM
> > > > capabilities we have that would allow you to have "offline"
> > > > objects
> > > > that would be pulled back in when used, but are otherwise just
> > > > visible
> > > > in the metadata only.
> > > > The underlying capabilities are pretty rich esp. if we also
> > > > take
> > > > into
> > > > account the eventual WBC stuff.
> > > 
> > > The major problem of current HSM is that it has to have dedicated
> > > clients to move data. Also, scanning the entire Lustre file
> > > system
> > 
> > This (dedicated client) is an implementation detail. It could be
> > improved in many ways and the effort spent on this would bring
> > great
> > benefit to everyone?
> 
> 
> Almost all designs assume some upfront implementation. We (as the
> Lustre team) considered running clients on OST nodes, but cloud users
> are sensitive about their data being exposed elsewhere.
> 
> Can you list a few improvements that come to mind?

Clients between server nodes is probably one of the most obvious
choices indeed. Considering the data already resides on those nodes, I
am not sure I understand the concerns about "exposing" data that's
already on those nodes. If customers are so sensitive, we support data
encryption.
We could also do some direct server-server migration of some sort where
OSTs exchange data without bringing up real clients and doing copies
from userspace. That might be desirable for other reasons for future
functionality (e.g. various caching things people have been long
envisioning)
 
> > 
> > > takes very long time so it resorts to databases in order to make
> > > correct decisions about which file should be released. By the
> > > time,
> > > the two system will be out of sync. That makes it practically
> > > unusable.
> > 
> > This again is an implementation detail, not even hardcoded
> > anywhere.
> > How do you plan for the OST to to know what stuff is not used
> > without
> > resorting to some database or scan? Now take this method and make
> > it
> > report "upstream" where currently HSM implementations resort to
> > databases or scans.
> > 
> 
> The assumption is that OST sizes are relatively small, up to 100TB.
> Also, scanning local devices in kernel spaces is much faster. So yeah
> there is no database in the way.

I am not sure why? In general small OSTs are a relatively rare thing
because to reach large FS sizes you would need to many of them, space
balancing becomes a chore and so on. So relatively few do it for some
fringe reasons (e.g. Google) . Majority of people prefer large OSTs.

Also nothing stops you from doing a per-OST scan (when you do have
small OSTs) and then kicking the results up to the acting agent to do
something about it (or the other way around, the hsm engine can ask
OSTs one by one (picking less busy ones or ones that have the least
free space, or some other factor). And there's absolutely no need to
wait out to query all OSTs, you can get results from one and work on
the data from it while the other OSTs are still thinking (or not,
there's absolutely no requirement to get full filesystem data before
making any decisions).

> I guess users won't have 1PB OSTs, will they?

There probably are already? NASA has a known 0.5P OST configuration:
https://www.nas.nasa.gov/hecc/support/kb/lustre-progressive-file-layout-(pfl)-with-ssd-and-hdd-pools_680.html#:~:text=The%20available%20SSD%20space%20in%20each%20filesystem,decimal%20(far%20right)%20labels%20of%20each%20OST
.

> > Rereading your proposal, I see that this particular detail is not
> > covered and it's just assumed that "infrequently accessed data"
> > would
> > be somehow known.
> 
> I should have mentioned that in the migration section. Also, we need
> to slightly update the OST read to use a local transaction to update
> an object's access time (atime) if it's older than a predefined
> threshold, for example, 10 minutes. 

This is going to be fragile in the face of varying clock times on
different clients potentially not synced with the servers.
Also in the face of -o noatime.

But yes, I guess it's one way to get this "on the cheap", and the other
trouble I foresee is you are going to have a biased set. Only recently
touched objects (so with fresh atime), unless you plan to retain a
database and update it from such transaction flow, which certainly is
possible, but I am not sure how practical vs some sort of a scan.

> > > > If the argument is "but OSTs know best what stuff is used"
> > > > (which I
> > > > am
> > > > not sure I buy, after all before you could use something off
> > > > OSTs
> > > > you
> > > > need to open a file I would hope) even then OSTs could just
> > > > signal
> > > > a
> > > > list of "inactive objects" that then a higher level system
> > > > would
> > > > take
> > > > care of by relocatiing somewhere more sensical and changing the
> > > > layout
> > > > to indicate those objects now live elsewhere.
> > > > 
> > > > The plus here is you don't need to attach this "Wart" to every
> > > > OST
> > > > and
> > > > configure it everywhere and such, but rather have a central
> > > > location
> > > > that is centrally managed.
> > > > 
> > 
> > _______________________________________________
> > lustre-devel mailing list
> > lustre-devel at lists.lustre.org
> > http://lists.lustre.org/listinfo.cgi/lustre-devel-lustre.org
> > 



More information about the lustre-devel mailing list