[lustre-discuss] multi-hop routing

Horn, Chris chris.horn at hpe.com
Thu Mar 6 08:23:44 PST 2025


Sounds like a bug. What Lustre version is being used?

Chris Horn

From: John White <jwhite at lbl.gov>
Date: Wednesday, March 5, 2025 at 5:48 PM
To: Horn, Chris <chris.horn at hpe.com>
Cc: lustre-discuss at lists.lustre.org <lustre-discuss at lists.lustre.org>
Subject: Re: [lustre-discuss] multi-hop routing
Just a quick follow-up for posterity, I did seem to need to add a route for tcp to the server-side.  lctl ping was working but MGS communication was failing saying it couldn’t talk back to the router:
[Wed Mar  5 15:28:26 2025] LNetError: 28576:0:(lib-move.c:2078:lnet_handle_find_routed_path()) no route to 10.38.0.250 at tcp from 10.5.250.22 at o2ib
[Wed Mar  5 15:28:26 2025] LNetError: 28576:0:(lib-move.c:3991:lnet_parse_get()) 10.5.250.22 at o2ib: Unable to send REPLY for GET from 12345-10.38.0.250 at tcp: -113

Adding a route to tcp from it’s geo-local router fixed that and we’ve got mounts passing IO.  Didn’t seem to need to do the same for clients at all.

> On Mar 5, 2025, at 2:29 PM, John White <jwhite at lbl.gov> wrote:
>
> Oh, so don’t even tell the client about tcp!  That seems to have immediately kicked things into place!
> I owe you a beverage of your choice if we ever meet up!
>
> Seriously, the imposter syndrome was getting _bad_ the last few days here.
>
>> On Mar 5, 2025, at 12:05 PM, Horn, Chris <chris.horn at hpe.com> wrote:
>>
>> You need LNet routes configured on all nodes. It should look something like this:
>>
>> # pdsh -w n0[0-3] 'lctl list_nids; lctl show_route' | dshbak -c
>> ----------------
>> server
>> ----------------
>> 172.18.2.5 at o2ib
>> net              o2ib2 hops 2 gw                  172.18.2.6 at o2ib up pri 0
>> ----------------
>> router1
>> ----------------
>> 172.18.2.6 at o2ib
>> 172.18.2.2 at tcp
>> net              o2ib2 hops 1 gw                   172.18.2.3 at tcp up pri 0
>> ----------------
>> router2
>> ----------------
>> 172.18.2.7 at o2ib2
>> 172.18.2.3 at tcp
>> net               o2ib hops 1 gw                   172.18.2.2 at tcp up pri 0
>> ----------------
>> client
>> ----------------
>> 172.18.2.8 at o2ib2
>> net               o2ib hops 2 gw                 172.18.2.7 at o2ib2 up pri 0
>> #
>> Chris Horn
>> From: lustre-discuss <lustre-discuss-bounces at lists.lustre.org> on behalf of John White via lustre-discuss <lustre-discuss at lists.lustre.org>
>> Date: Wednesday, March 5, 2025 at 1:17 PM
>> To: lustre-discuss at lists.lustre.org <lustre-discuss at lists.lustre.org>
>> Subject: [lustre-discuss] multi-hop routing
>> Hello folks.  I have a rare situation that I’m told some centers are successfully pulling off and am looking for guidance - multi-hop lnet routing.
>> In short, I have 2 distinct o2ib fabrics at disparate geo sites joined by a routed ethernet fabric.  I’m looking to use a 2-lnet-router chain to plumb the two o2ib fabrics together.
>>
>> servers on the left, clients on the right
>> o2ib0(10.5.0.0/16) <-> router(o2ib0,tcp0) <-> routed eth (10.37.0.0/16, 10.38.0.0/16) <-> router(tcp0,o2ib2) <-> o2ib2(10.6.0.0/16)
>>
>> I have both sets of routers up but traffic absolutely fails the 2nd hop in either direction (I can `lctl ping` tcp0 from o2ib2 and o2ib0 but no further).
>>
>> I’ve tried adding a route ON the routers, that didn’t help.
>>
>> I’ve tried defining the 2nd hop on the client:
>> options lnet routes="tcp0 10.6.0.[250-251]@o2ib2;\
>> o2ib0 10.37.250.[162-163]@tcp0”
>>
>> but that failed with the following kern message on lnet load:
>> 74067:0:(router.c:644:lnet_add_route()) Cannot add route with gateway 10.37.250.162 at tcp. There is no local interface configured on LNet tcp
>>
>> Does anyone have any hints here?  It feels like I’m a syntax change or a routing hint away from getting this working.
>> _______________________________________________
>> lustre-discuss mailing list
>> lustre-discuss at lists.lustre.org
>> https://urldefense.com/v3/__http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org__;!!NpxR!keuGPb7MHd7CQc6Zi_uwIvFahK68FJfbq9MNIXgHpd0W8bi5vOYFHf-IixYY5DiOnJKx0z9-Ht8VqH1ew82XWtaTRaoq$<https://urldefense.com/v3/__http:/lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org__;!!NpxR!keuGPb7MHd7CQc6Zi_uwIvFahK68FJfbq9MNIXgHpd0W8bi5vOYFHf-IixYY5DiOnJKx0z9-Ht8VqH1ew82XWtaTRaoq$>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20250306/44c7f4c3/attachment.htm>


More information about the lustre-discuss mailing list