[lustre-devel] [LSF/MM/BPF TOPIC] [DRAFT] Lustre client upstreaming

Sat Feb 1 14:58:36 PST 2025

On Thu, 30 Jan 2025, Alexey Lyahkov wrote:
> 
> 
> > On 29 Jan 2025, at 22:00, Day, Timothy <timday at amazon.com> wrote:
> > 
> >>>>> That's why we'll
> >>>>> still support running latest Lustre on older distros. Specifically, it'll be the Lustre
> >>>>> code from a mainline kernel combined with our lustre_compat/ compatibility
> >>>>> code. So normal Lustre releases will be derived directly from the in-tree kernel
> >>>>> code. This provides a path for vendors to deploy bug fixes, custom features, and
> >>>>> allows users to optionally run the latest and greatest Lustre code.
> >>>> 
> >>>> And OOPS. Both codes (in-kernel and out-of-tree) have a same sort of defines in config.h which have conflicts with building for out-of-free Lustre.
> >>>> Some examples for MOFED hacks to solve same problem you can see in the o2iblnd:
> >>>>>>> 
> >>>> #if defined(EXTERNAL_OFED_BUILD) && !defined(HAVE_OFED_IB_DMA_MAP_SG_SANE)
> >>>> #undef CONFIG_INFINIBAND_VIRT_DMA
> >>>> #endif
> >>>>>>> 
> >>>> As I remember this problem broke an ability to build a lustre as out-of-tree kernel on the ubuntu 18.06 with lustre in staging/.
> >>> 
> >>> I think we should be able to validate the Lustre still builds as an
> >>> out-of-tree module by re-using a lot of the testing we already
> >>> do today in Jenkins/Maloo.
> >> 
> >> Yes. Me do. But it needs many extra resources. Did Amazon ready to provide such HW resources for it?
> >> Or who will be pay for it? It’s cost of the moving to the kernel.
> > 
> > I suppose I disagree that this testing requires many extra
> > resources. This is just validate the same things we validate
> > today (i.e. that Lustre is functional on RHEL kernels). But the
> > build process looks different.
> > 
> Ah. So you don’t expect to do any performance testing?
> Performance testing needs a 20 nodes cluster with IB HDR network (400G) and E1000 with NVMe drives as minimal. 
> Otherwise servers / network will be bottleneck.
> And week or so for load to be sure no regression exist. Some problems can found just with 48h continuas load.
> And that is minimal performance testing.
> I don’t say about scale testing with 100+ client nodes.
> You think we needs to drop it? If no - who will provide HW for such testing.

We at SUSE have performance team.  We do some testing on upstream
because that guards our future, but (I believe) we do most testing on
our own releases to ensure we don't regress and to find problems before
our customers.  The key observation is that Linus' upstream kernel
doesn't have to be perfect.  There are regressions all the time.  That
is why we have the -stable trees.  That is how distros like SUSE and
Redhat make money.

Yes, we want to be doing correctness and performance testing on
mainline, but we don't need to ensure we block any regressions.  We only
need to eventually find regressions and then fix them.  Hopefully we
find and fix before the regression gets to any of our customers (though
it reality our customers find quite a few of our regressions).

> 
> 
> >>> All we'd need to do it kick off test/build
> >>> sessions once the merge window closes. Based on the MOFED
> >>> example you gave, it seems like this is solvable.
> >> 
> >> Sure, All can be solved. But what are cost for this and cost for support these changes?
> >> And next question - who will pay for this cost? Who will provide an HW for extra testing?
> >> So second face of “no cost for kernel API changes” - it will be a problems with back porting these changes and extra testing.
> > 
> > I don't think the backporting will be more burdensome
> > than porting Lustre to new kernels. And we don't have to
> > urgently backport each upstream release to older kernels.
> Neil B, say we needs to move all development to the mainstream.  It’s
> mean kernel upstream will be same as ‘master’ branch now.
> So each change needs to be back ported to older kernels to sync with
> servers work and make ready for lustre release.
> Otherwise we will have a ton changes needs to be backported on each
> lustre release. 
> I see no differences with porting to upstream, except this porting
> from mainline to old kernels should be handled ASAP do avoid lustre
> release delay, while porting to the mainstream may delayed as it not
> critical for customers.

Porting to upstream doesn't work.  The motivation isn't strong enough
and people leave it then forget it and you get too much divergence and
it become harder so people do it even less.  People have tried.  People
have failed.

Backporting from upstream to an older kernel isn't that hard.  I do a
lot of it and with the right tools it is mostly easy.  One of the
biggest difficulties is when we try to backport only a selection of
patches because we might miss an important dependency.  Sometimes it is
worth it to avoid churn, sometime it is best to apply everything
relevant.  I assume that for the selection of kernels that whamcloud (or
whoever) want to support, they would backport everything that could
apply.  I think that would be largely mechanical.

Maybe it would be good for me to paint a more details picture of what I
imagine would happen - assuming we do take the path of landing all of
lustre, both client and server, upstream.

- we would change the kernel code in lustre-release so that it was
  exactly what we plan to submit upstream.
- we submit it and once accepted we have identical code in upstream
  linux and lustre-release
- we fork lustre-release to a new package called (e.g.) lustre-tools and 
  remove all kernel code leaving just utils and documentation and 
  test code.  The kinode.c kernel module that is in lustre/tests/kernel/
  would need to go upstream with the rest of the kernel code I think.
  lustre-tools would be easily accessible and buildable by anyone who
  wants to test lustre
- we fork lustre-release to another new package lustre-backports
  and remove all non-kernel code from there.  We configure it to build
  out-of-tree modules with names like "backport-lustre" "backport-lnet"
  and provide modprobe.conf files that alias the standard names to
  these.  That should allow to over-ride the distributed modules (if
  any) when people choose to use backports.
- upstream commits which touch lustre or lnet are automatically add to
  lustre-backports and someone is notified to help when they don't apply

With this:
 Anyone who wants to test or use the lustre included with a particular
 kernel can do with with only the lustre-tools package.  Anyone who
 wants to use the latest lustre code with an older kernel can build and
 use lustre-backports.

There are probably rough-edges with this but I suspect they can be filed
down.

thanks,
NeilBrown