[lustre-devel] [LSF/MM/BPF TOPIC] [DRAFT] Lustre client upstreaming

Oleg Drokin green at whamcloud.com
Sun Jan 19 13:20:31 PST 2025


On Sun, 2025-01-19 at 09:48 +1100, NeilBrown wrote:
> On Sat, 18 Jan 2025, Oleg Drokin wrote:
> > On Sat, 2025-01-18 at 11:45 +1100, NeilBrown wrote:
> > > We need to demonstrate a process for, and commitment to, moving
> > > away
> > > from the dual-tree model.  We need patches to those parts of
> > > Lustre
> > > that are upstream to land in upstream first (mostly).
> > 
> > I think this is not very realistic.
> > Large chunk (100%?) of users do not run not only the latest kernel
> > release, they don't run the latest LTS either.
> Are you referring to lustre users or all Linux users?

Lustre users (very minuscle part of Linux users run Lustre anyway)

> If the latter, then xfs etc face the same problem and seem to manage.
> If lustre users: they can't because the latest kernel doesn't include
> lustre.  Maybe you are seeing a chicken-and-egg problem?

They are related. A regular person can run xfs at home (being a single
node local filesystem and all (I know cluster xfs exists but I am not
sure Linux actually supports it) and then when Fedora/Redhad decided
xfs is the default install fs in some cases - the adoption
understandably shot up too.
Now once we get into networked filesystems, even commonplace things
like NFS are barely run by anybody. Lustre? Would still remain in large
datacenters that are not exactly known for running on the bleeding edge
(rhel7 is still strong apparently)

> 
> > Of course some automatic infrastructure could be built up to make
> > it
> > somewhat better, but it does not remove the problem of "nobody
> > would
> > run this mainline tree", I am afraid.
> 
> We've never had a credible lustre in a mainline tree, so we cannot
> know
> how many people would use it.  Importantly developers would use it
> because that is where development would happen.

Well, I agree if we could have actual development happen there it would
change everything. But the problem here is this decision is outside the
hands of developers as I just wrote to Tim in the other email.
Too many management types are anti-open development and Lustre is kinda
niche, so only a handful of companies that control the actual
developers.

> > It does not hep that there are what 3? 4? trees, not "dual-tree" by
> > any
> > stretch of imagination.
> > 
> > There's DDN/whamcloud (that's really two trees), there's HPE, LLNL
> > keeps their fork still I think (thought it's mostly backports?).
> > There
> > are likely others I am less exposed to.
> 
> "dual-tree" maybe isn't the best way of describing what was wrong
> with
> the previous approach.  "upstream-first" is one way of describing how
> it
> should be run, though that needs to be in understood correctly.

Yes. I agree. And this is exactly what kernel maintainers demand (or
would if they don't yet). But in the land of "we must have this
differentiating feature in order to sell our product over the
competitor offering" it does not fly.
In the past where a lot of the market was controlled by various
government labs that mandated opensource it was easier, but with foray
into various commercial deployments and esp. with huge demand from AI
installations that don't care about anything beside "I want things to
work the best right now" suddenly this factor mostly evaporated and we
are descending into "closed hell" with increased speed I am afraid.

> > Sure, only one of those trees is considered "community Lustre", but
> > if
> > it will detach too much from what majority of developers really
> > runs
> > and gets paid to do - the "community Lustre" contributions probably
> > would diminish greatly, I am afraid.
> > 
> > The past situation of "oh, this new enterprise linux comes with a
> > community lustre version, so the first step to get something usable
> > is
> > to rip it entirely off and then apply the new good version" is not
> > exactly desirable either I am afraid.
> 
> Obviously that is not what we want, and clearly people aren't tempted
> to
> do that with any of FS so why do you think it will happen with
> lustre?

Already happened (several times in different ways)
I think I have a faint memory with other kernel components having a
similar problem.

I guess MOFED is the most current example.
"rip out in kernel ib stuff, replace with our greatest shiny"

> The "new good version" will simply be a few patches on top of
> whatever
> kernel you have.  Hopefully the distributor of that kernel will have
> applied those already if any of their customers care about the
> filesystem.

That is the ideal, anyway, but seems somewhat hard to reach.

> > > We need to quickly reach a point where a lustre release is:
> > > 
> > >  - a verbatim copy of relevant files from a chosen upstream
> > > release,
> > >    or just a dependency on that kernel source.
> > >  - a bunch of extra files that might one day go upstream: server
> > > code
> > >    and LNet protocol code
> > >  - a *few* patches to integrate that code
> > >  - some number of patches which have since gone upstream -
> > > bugfixes
> > > etc.
> > >  - libcfs which contains a compat layer for older kernels.
> > >  - user-space code, documentation, test scripts, etc for which
> > > there
> > >    is no expectation of upstreaming to linux kernel.
> > 
> > All these sound like an awful lot of dedicated developer-hours.
> > 
> > > Maybe the question for LSF is : what is a sufficient
> > > demonstration of
> > > commitment?
> > > 
> > > The big question for us is : how are we going to transition our
> > > infrastructure to this model?
> > 
> > and who would pay for it.
> Obviously there will be a cost to transition.  It seems someone is
> already willing to pay some of that because patches have been landing
> which are only there to make the ultimate transition easier.  Why do
> you
> think that will stop.

it won't, but at the current rate I am not even sure conversion is
happening faster than breakage ;)

> 
> Once the transition completes there will still be process
> difficulties,
> but there are plenty of of process difficulties now (gerrit: how do I
> hate thee, let me count the ways...) but people seem to simply
> include
> that in the cost of doing business.

it's been awhile since I did patch reviews by emails, but I think
gerrit is much more user-friendly (if you have internet, anyway)

> > This in the end was the downfall of the previous attempt. There
> > never
> > was any serious funding behind the effort so it became an
> > afterthought
> > for most.
> 
> I don't think funding is the big problem.  I think it is "buy-in".
> Individual people in positions of power - such as yourself - need to
> see
> the value and be willing to change they way they work.  If you,
> personally, are not willing to change then there is no point even
> talking about this any more.

While I see value, people in position of actual power (e.g. those that
pay my salary and get to dictate priorities) don't agree this is a good
idea to change the development process to the fully open model.



More information about the lustre-devel mailing list