<div dir="ltr"><div dir="ltr"><br></div><br><div class="gmail_quote gmail_quote_container"><div dir="ltr" class="gmail_attr">On Mon, Nov 3, 2025 at 5:58 PM Oleg Drokin <<a href="mailto:green@whamcloud.com">green@whamcloud.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">On Mon, 2025-11-03 at 16:33 -0800, Jinshan Xiong wrote:<br>

> > > > <br>

> > > > I am not sure it's a much better idea than the already existing<br>

> > > > HSM<br>

> > > > capabilities we have that would allow you to have "offline"<br>

> > > > objects<br>

> > > > that would be pulled back in when used, but are otherwise just<br>

> > > > visible<br>

> > > > in the metadata only.<br>

> > > > The underlying capabilities are pretty rich esp. if we also<br>

> > > > take<br>

> > > > into<br>

> > > > account the eventual WBC stuff.<br>

> > > <br>

> > > The major problem of current HSM is that it has to have dedicated<br>

> > > clients to move data. Also, scanning the entire Lustre file<br>

> > > system<br>

> > <br>

> > This (dedicated client) is an implementation detail. It could be<br>

> > improved in many ways and the effort spent on this would bring<br>

> > great<br>

> > benefit to everyone?<br>

> <br>

> <br>

> Almost all designs assume some upfront implementation. We (as the<br>

> Lustre team) considered running clients on OST nodes, but cloud users<br>

> are sensitive about their data being exposed elsewhere.<br>

> <br>

> Can you list a few improvements that come to mind?<br>

<br>

Clients between server nodes is probably one of the most obvious<br>

choices indeed. Considering the data already resides on those nodes, I<br>

am not sure I understand the concerns about "exposing" data that's<br>

already on those nodes. If customers are so sensitive, we support data<br>

encryption.<br>

We could also do some direct server-server migration of some sort where<br>

OSTs exchange data without bringing up real clients and doing copies<br>

from userspace. That might be desirable for other reasons for future<br>

functionality (e.g. various caching things people have been long<br>

envisioning)</blockquote><div><br></div><div>This is going too far ;-)</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"> </blockquote><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

 <br>

> > <br>

> > > takes very long time so it resorts to databases in order to make<br>

> > > correct decisions about which file should be released. By the<br>

> > > time,<br>

> > > the two system will be out of sync. That makes it practically<br>

> > > unusable.<br>

> > <br>

> > This again is an implementation detail, not even hardcoded<br>

> > anywhere.<br>

> > How do you plan for the OST to to know what stuff is not used<br>

> > without<br>

> > resorting to some database or scan? Now take this method and make<br>

> > it<br>

> > report "upstream" where currently HSM implementations resort to<br>

> > databases or scans.<br>

> > <br>

> <br>

> The assumption is that OST sizes are relatively small, up to 100TB.<br>

> Also, scanning local devices in kernel spaces is much faster. So yeah<br>

> there is no database in the way.<br>

<br>

I am not sure why? In general small OSTs are a relatively rare thing<br>

because to reach large FS sizes you would need to many of them, space<br>

balancing becomes a chore and so on. So relatively few do it for some<br>

fringe reasons (e.g. Google) . Majority of people prefer large OSTs.<br>

<br>

Also nothing stops you from doing a per-OST scan (when you do have<br>

small OSTs) and then kicking the results up to the acting agent to do<br>

something about it (or the other way around, the hsm engine can ask<br>

OSTs one by one (picking less busy ones or ones that have the least<br>

free space, or some other factor). And there's absolutely no need to<br>

wait out to query all OSTs, you can get results from one and work on<br>

the data from it while the other OSTs are still thinking (or not,<br>

there's absolutely no requirement to get full filesystem data before<br>

making any decisions).<br>

<br>

> I guess users won't have 1PB OSTs, will they?<br>

<br>

There probably are already? NASA has a known 0.5P OST configuration:<br>

<a href="https://www.nas.nasa.gov/hecc/support/kb/lustre-progressive-file-layout-(pfl)-with-ssd-and-hdd-pools_680.html#:~:text=The%20available%20SSD%20space%20in%20each%20filesystem,decimal%20(far%20right)%20labels%20of%20each%20OST" rel="noreferrer" target="_blank">https://www.nas.nasa.gov/hecc/support/kb/lustre-progressive-file-layout-(pfl)-with-ssd-and-hdd-pools_680.html#:~:text=The%20available%20SSD%20space%20in%20each%20filesystem,decimal%20(far%20right)%20labels%20of%20each%20OST</a><br>

.<br>

<br>

> > Rereading your proposal, I see that this particular detail is not<br>

> > covered and it's just assumed that "infrequently accessed data"<br>

> > would<br>

> > be somehow known.<br>

> <br>

> I should have mentioned that in the migration section. Also, we need<br>

> to slightly update the OST read to use a local transaction to update<br>

> an object's access time (atime) if it's older than a predefined<br>

> threshold, for example, 10 minutes. <br>

<br>

This is going to be fragile in the face of varying clock times on<br>

different clients potentially not synced with the servers.<br>

Also in the face of -o noatime.<br></blockquote><div><br></div><div>It doesn't use client timestamps. Also, it won't be part of read because read doesn't initiate a transaction.</div><div><br></div><div>It can simply use the OSS local time and start a local transaction to update the atime. </div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

<br>

But yes, I guess it's one way to get this "on the cheap", and the other<br>

trouble I foresee is you are going to have a biased set. Only recently<br>

touched objects (so with fresh atime), unless you plan to retain a<br>

database and update it from such transaction flow, which certainly is<br>

possible, but I am not sure how practical vs some sort of a scan.<br></blockquote><div><br></div><div>I don't see that as an issue. It's going to update the atime in memory. And yeah if the OSS is crashed and the scanner may choose a wrong file to migrate, this should be rare and I don't think this would become a severe issue.</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

<br>

> > > > If the argument is "but OSTs know best what stuff is used"<br>

> > > > (which I<br>

> > > > am<br>

> > > > not sure I buy, after all before you could use something off<br>

> > > > OSTs<br>

> > > > you<br>

> > > > need to open a file I would hope) even then OSTs could just<br>

> > > > signal<br>

> > > > a<br>

> > > > list of "inactive objects" that then a higher level system<br>

> > > > would<br>

> > > > take<br>

> > > > care of by relocatiing somewhere more sensical and changing the<br>

> > > > layout<br>

> > > > to indicate those objects now live elsewhere.<br>

> > > > <br>

> > > > The plus here is you don't need to attach this "Wart" to every<br>

> > > > OST<br>

> > > > and<br>

> > > > configure it everywhere and such, but rather have a central<br>

> > > > location<br>

> > > > that is centrally managed.<br>

> > > > <br>

> > <br>

> > _______________________________________________<br>

> > lustre-devel mailing list<br>

> > <a href="mailto:lustre-devel@lists.lustre.org" target="_blank">lustre-devel@lists.lustre.org</a><br>

> > <a href="http://lists.lustre.org/listinfo.cgi/lustre-devel-lustre.org" rel="noreferrer" target="_blank">http://lists.lustre.org/listinfo.cgi/lustre-devel-lustre.org</a><br>

> > <br>

<br>

</blockquote></div></div>