[lustre-devel] [LSF/MM/BPF TOPIC] [DRAFT] Lustre client upstreaming

Sat Feb 1 23:33:34 PST 2025

> On 2 Feb 2025, at 01:58, NeilBrown <neilb at suse.de> wrote:
> 
> On Thu, 30 Jan 2025, Alexey Lyahkov wrote:
>> 
>> 
>>> On 29 Jan 2025, at 22:00, Day, Timothy <timday at amazon.com> wrote:
>>> 
>>>>>>> That's why we'll
>>>>>>> still support running latest Lustre on older distros. Specifically, it'll be the Lustre
>>>>>>> code from a mainline kernel combined with our lustre_compat/ compatibility
>>>>>>> code. So normal Lustre releases will be derived directly from the in-tree kernel
>>>>>>> code. This provides a path for vendors to deploy bug fixes, custom features, and
>>>>>>> allows users to optionally run the latest and greatest Lustre code.
>>>>>> 
>>>>>> And OOPS. Both codes (in-kernel and out-of-tree) have a same sort of defines in config.h which have conflicts with building for out-of-free Lustre.
>>>>>> Some examples for MOFED hacks to solve same problem you can see in the o2iblnd:
>>>>>>>>> 
>>>>>> #if defined(EXTERNAL_OFED_BUILD) && !defined(HAVE_OFED_IB_DMA_MAP_SG_SANE)
>>>>>> #undef CONFIG_INFINIBAND_VIRT_DMA
>>>>>> #endif
>>>>>>>>> 
>>>>>> As I remember this problem broke an ability to build a lustre as out-of-tree kernel on the ubuntu 18.06 with lustre in staging/.
>>>>> 
>>>>> I think we should be able to validate the Lustre still builds as an
>>>>> out-of-tree module by re-using a lot of the testing we already
>>>>> do today in Jenkins/Maloo.
>>>> 
>>>> Yes. Me do. But it needs many extra resources. Did Amazon ready to provide such HW resources for it?
>>>> Or who will be pay for it? It’s cost of the moving to the kernel.
>>> 
>>> I suppose I disagree that this testing requires many extra
>>> resources. This is just validate the same things we validate
>>> today (i.e. that Lustre is functional on RHEL kernels). But the
>>> build process looks different.
>>> 
>> Ah. So you don’t expect to do any performance testing?
>> Performance testing needs a 20 nodes cluster with IB HDR network (400G) and E1000 with NVMe drives as minimal. 
>> Otherwise servers / network will be bottleneck.
>> And week or so for load to be sure no regression exist. Some problems can found just with 48h continuas load.
>> And that is minimal performance testing.
>> I don’t say about scale testing with 100+ client nodes.
>> You think we needs to drop it? If no - who will provide HW for such testing.
> 
> We at SUSE have performance team.  We do some testing on upstream
> because that guards our future, but (I believe) we do most testing on
> our own releases to ensure we don't regress and to find problems before
> our customers.  The key observation is that Linus' upstream kernel
> doesn't have to be perfect.  There are regressions all the time.  That
> is why we have the -stable trees.  That is how distros like SUSE and
> Redhat make money.
> 
Thanks, I know it. RedHat and SuSe makes unusable kernel in upstream and take a money to make it better.

> Yes, we want to be doing correctness and performance testing on
> mainline, but we don't need to ensure we block any regressions.  We only
> need to eventually find regressions and then fix them.  Hopefully we
> find and fix before the regression gets to any of our customers (though
> it reality our customers find quite a few of our regressions).

So
1) Lustre perf team needs more work - to have run performance testing on mainstream, and LTS branches. To find a regressions.
2) Lustre developers needs to look to these regressions, and fix this time to time.
3) Once it’s not blocking for quality, Lustre client quality is poor for linux kernel, bugs and performance issues. It blocked to use this from kernel mainline.
4) and Lustre client and server separation introduce a problems with development.

(1) - mean extra peoples needs hired and extra HW needs to involve.
(2,4) - extra peoples needs hire.
(3) - any real customer still needs to use Lustre from repository outside of linux kernel, so lustre code in the linux kernel don’t used in the real production.
If no 

What is benefits to spent these money ? Just avoid some non priority task with porting to the new kernel which needs to be done onece per several years.
(When new SuSe or RedHat release created).

It looks anyone don’t understand - Lustre porting to the new kernel not so hard work and not a priority. Single problem remember from begin - just change page to folio.
Lustre have inside own MM stack, once it designed to work on the different platforms - MacOS, Windows /yes, in some private branch had Windows native client but it for Lustre 2.1 as I remember/, FreeBSD…. Currently large portion of compatibility and portability code had removed after first  step to moving to kernel upstream.
But again - Lustre have own page tree, own paging daemon, for server side it have own ‘VFS’ stack on MD servers, and own data path with preallocated page buffer on OST. It have own network stack (LNet) with own routing / forwarding and protocol conversions (LNet router).
But problems related to cache coherency between clients, distributed transactions for MD - much harder.

>> 
>> 
>>>>> All we'd need to do it kick off test/build
>>>>> sessions once the merge window closes. Based on the MOFED
>>>>> example you gave, it seems like this is solvable.
>>>> 
>>>> Sure, All can be solved. But what are cost for this and cost for support these changes?
>>>> And next question - who will pay for this cost? Who will provide an HW for extra testing?
>>>> So second face of “no cost for kernel API changes” - it will be a problems with back porting these changes and extra testing.
>>> 
>>> I don't think the backporting will be more burdensome
>>> than porting Lustre to new kernels. And we don't have to
>>> urgently backport each upstream release to older kernels.
>> Neil B, say we needs to move all development to the mainstream.  It’s
>> mean kernel upstream will be same as ‘master’ branch now.
>> So each change needs to be back ported to older kernels to sync with
>> servers work and make ready for lustre release.
>> Otherwise we will have a ton changes needs to be backported on each
>> lustre release. 
>> I see no differences with porting to upstream, except this porting
>> from mainline to old kernels should be handled ASAP do avoid lustre
>> release delay, while porting to the mainstream may delayed as it not
>> critical for customers.
> 
> Porting to upstream doesn't work.  The motivation isn't strong enough
> and people leave it then forget it and you get too much divergence and
> it become harder so people do it even less.  People have tried.  People
> have failed.
> 
Porting to upstream had work for last 20years from product start.
Yes, this is not for each kernel release, but it’s don’t needs.

> Backporting from upstream to an older kernel isn't that hard.  
> I do a
> lot of it and with the right tools it is mostly easy.  One of the
> biggest difficulties is when we try to backport only a selection of
> patches because we might miss an important dependency.  Sometimes it is
> worth it to avoid churn, sometime it is best to apply everything
> relevant.  I assume that for the selection of kernels that whamcloud (or
> whoever) want to support, they would backport everything that could
> apply.  I think that would be largely mechanical.
> 
> Maybe it would be good for me to paint a more details picture of what I
> imagine would happen - assuming we do take the path of landing all of
> lustre, both client and server, upstream.
> 
> - we would change the kernel code in lustre-release so that it was
>  exactly what we plan to submit upstream.
> - we submit it and once accepted we have identical code in upstream
>  linux and lustre-release
> - we fork lustre-release to a new package called (e.g.) lustre-tools and 
>  remove all kernel code leaving just utils and documentation and 
>  test code.  The kinode.c kernel module that is in lustre/tests/kernel/
>  would need to go upstream with the rest of the kernel code I think.
>  lustre-tools would be easily accessible and buildable by anyone who
>  wants to test lustre
> - we fork lustre-release to another new package lustre-backports
>  and remove all non-kernel code from there.  We configure it to build
>  out-of-tree modules with names like "backport-lustre" "backport-lnet"
>  and provide modprobe.conf files that alias the standard names to
>  these.  That should allow to over-ride the distributed modules (if
>  any) when people choose to use backports.
> - upstream commits which touch lustre or lnet are automatically add to
>  lustre-backports and someone is notified to help when they don't apply
> 
> With this:
> Anyone who wants to test or use the lustre included with a particular
> kernel can do with with only the lustre-tools package.  Anyone who
> wants to use the latest lustre code with an older kernel can build and
> use lustre-backports.
> 
> There are probably rough-edges with this but I suspect they can be filed
> down.
> 
> thanks,
> NeilBrown

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-devel-lustre.org/attachments/20250202/cbbeaaaa/attachment-0001.htm>