[lustre-discuss] Lustre 2.11 lnet troubleshooting

Faaland, Olaf P. faaland1 at llnl.gov
Thu Apr 19 08:53:42 PDT 2018


I haven't tested 2.10 yet, but I may get a chance to today.  I created ticket

https://jira.hpdd.intel.com/browse/LU-10930

thanks,

Olaf P. Faaland
Livermore Computing

________________________________________
From: Dilger, Andreas <andreas.dilger at intel.com>
Sent: Wednesday, April 18, 2018 8:37:43 PM
To: Faaland, Olaf P.
Cc: lustre-discuss at lists.lustre.org; Shehata, Amir
Subject: Re: [lustre-discuss] Lustre 2.11 lnet troubleshooting

On Apr 17, 2018, at 19:00, Faaland, Olaf P. <faaland1 at llnl.gov> wrote:
>
> So the problem was inded that "routing" was disabled on the router node.  I added "routing: 1" to the lnet.conf file for the routers and lctl ping works as expected.
>
> The question about the lnet module option "forwarding" still stands.  The lnet module still accepts a parameter, "forwarding", but it doesn't do what it used to.   Is that just a leftover that needs to be cleaned up?

I would say that the module parameter should continue to work, and be equivalent to the "routing: 1" YAML parameter.  This facilitates upgrades.

Did you try this with 2.10 (which also has LNet Multi-Rail), or are you coming from 2.7 or 2.8?

I'd recommend to file a ticket in Jira for this.  I suspect it might also be broken in 2.10, and the fix should be backported there as well.

Cheers, Andreas

> ________________________________________
> From: Faaland, Olaf P.
> Sent: Tuesday, April 17, 2018 5:05 PM
> To: lustre-discuss at lists.lustre.org
> Subject: Re: Lustre 2.11 lnet troubleshooting
>
> Update:
>
> Joe pointed out "lnetctl set routing 1".  After invoking that on the router node, the compute node reports the route as up:
>
> [root at ulna66:lustre-211]# lnetctl route show -v
> route:
>    - net: o2ib100
>      gateway: 192.168.128.4 at o2ib33
>      hop: -1
>      priority: 0
>      state: up
>
> Does this replace the lnet module parameter "forwarding"?
>
> Olaf P. Faaland
> Livermore Computing
>
>
> ________________________________________
> From: lustre-discuss <lustre-discuss-bounces at lists.lustre.org> on behalf of Faaland, Olaf P. <faaland1 at llnl.gov>
> Sent: Tuesday, April 17, 2018 4:34:22 PM
> To: lustre-discuss at lists.lustre.org
> Subject: [lustre-discuss] Lustre 2.11 lnet troubleshooting
>
> Hi,
>
> I've got a cluster running 2.11 with 2 routers and 68  compute nodes.  It's the first time I've used a post-multi-rail version of Lustre.
>
> The problem I'm trying to troubleshoot is that my sample compute node (ulna66) seems to think the router I configured (ulna4) is down, and so an attempt to ping outside the cluster results in failure and "no route to XXX" on the console.  I can lctl ping the router from the compute node and vice-versa.   Forwarding is enabled on the router node via modprobe argument.
>
> lnetctl route show reports that the route is down.  Where I'm stuck is figuring out what in userspace (e.g. lnetctl or lctl) can tell me why.
>
> The compute node's lnet configuration is:
>
> [root at ulna66:lustre-211]# cat /etc/lnet.conf
> ip2nets:
>  - net-spec: o2ib33
>    interfaces:
>         0: hsi0
>    ip-range:
>         0: 192.168.128.*
> route:
>    - net: o2ib100
>      gateway: 192.168.128.4 at o2ib33
>
> After I start lnet, systemctl reports success and the state is as follows:
>
> [root at ulna66:lustre-211]# lnetctl net show
> net:
>    - net type: lo
>      local NI(s):
>        - nid: 0 at lo
>          status: up
>    - net type: o2ib33
>      local NI(s):
>        - nid: 192.168.128.66 at o2ib33
>          status: up
>          interfaces:
>              0: hsi0
>
> [root at ulna66:lustre-211]# lnetctl peer show --verbose
> peer:
>    - primary nid: 192.168.128.4 at o2ib33
>      Multi-Rail: False
>      peer ni:
>        - nid: 192.168.128.4 at o2ib33
>          state: up
>          max_ni_tx_credits: 8
>          available_tx_credits: 8
>          min_tx_credits: 7
>          tx_q_num_of_buf: 0
>          available_rtr_credits: 8
>          min_rtr_credits: 8
>          refcount: 4
>          statistics:
>              send_count: 2
>              recv_count: 2
>              drop_count: 0
>
> [root at ulna66:lustre-211]# lnetctl route show --verbose
> route:
>    - net: o2ib100
>      gateway: 192.168.128.4 at o2ib33
>      hop: -1
>      priority: 0
>      state: down
>
> I can instrument the code, but I figure there must be someplace available to normal users to look, that I'm unaware of.
>
> thanks,
>
> Olaf P. Faaland
> Livermore Computing
> _______________________________________________
> lustre-discuss mailing list
> lustre-discuss at lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
> _______________________________________________
> lustre-discuss mailing list
> lustre-discuss at lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Intel Corporation









More information about the lustre-discuss mailing list