[lustre-discuss] Lustre 2.11 lnet troubleshooting
Faaland, Olaf P.
faaland1 at llnl.gov
Thu Apr 19 08:53:42 PDT 2018
I haven't tested 2.10 yet, but I may get a chance to today. I created ticket
https://jira.hpdd.intel.com/browse/LU-10930
thanks,
Olaf P. Faaland
Livermore Computing
________________________________________
From: Dilger, Andreas <andreas.dilger at intel.com>
Sent: Wednesday, April 18, 2018 8:37:43 PM
To: Faaland, Olaf P.
Cc: lustre-discuss at lists.lustre.org; Shehata, Amir
Subject: Re: [lustre-discuss] Lustre 2.11 lnet troubleshooting
On Apr 17, 2018, at 19:00, Faaland, Olaf P. <faaland1 at llnl.gov> wrote:
>
> So the problem was inded that "routing" was disabled on the router node. I added "routing: 1" to the lnet.conf file for the routers and lctl ping works as expected.
>
> The question about the lnet module option "forwarding" still stands. The lnet module still accepts a parameter, "forwarding", but it doesn't do what it used to. Is that just a leftover that needs to be cleaned up?
I would say that the module parameter should continue to work, and be equivalent to the "routing: 1" YAML parameter. This facilitates upgrades.
Did you try this with 2.10 (which also has LNet Multi-Rail), or are you coming from 2.7 or 2.8?
I'd recommend to file a ticket in Jira for this. I suspect it might also be broken in 2.10, and the fix should be backported there as well.
Cheers, Andreas
> ________________________________________
> From: Faaland, Olaf P.
> Sent: Tuesday, April 17, 2018 5:05 PM
> To: lustre-discuss at lists.lustre.org
> Subject: Re: Lustre 2.11 lnet troubleshooting
>
> Update:
>
> Joe pointed out "lnetctl set routing 1". After invoking that on the router node, the compute node reports the route as up:
>
> [root at ulna66:lustre-211]# lnetctl route show -v
> route:
> - net: o2ib100
> gateway: 192.168.128.4 at o2ib33
> hop: -1
> priority: 0
> state: up
>
> Does this replace the lnet module parameter "forwarding"?
>
> Olaf P. Faaland
> Livermore Computing
>
>
> ________________________________________
> From: lustre-discuss <lustre-discuss-bounces at lists.lustre.org> on behalf of Faaland, Olaf P. <faaland1 at llnl.gov>
> Sent: Tuesday, April 17, 2018 4:34:22 PM
> To: lustre-discuss at lists.lustre.org
> Subject: [lustre-discuss] Lustre 2.11 lnet troubleshooting
>
> Hi,
>
> I've got a cluster running 2.11 with 2 routers and 68 compute nodes. It's the first time I've used a post-multi-rail version of Lustre.
>
> The problem I'm trying to troubleshoot is that my sample compute node (ulna66) seems to think the router I configured (ulna4) is down, and so an attempt to ping outside the cluster results in failure and "no route to XXX" on the console. I can lctl ping the router from the compute node and vice-versa. Forwarding is enabled on the router node via modprobe argument.
>
> lnetctl route show reports that the route is down. Where I'm stuck is figuring out what in userspace (e.g. lnetctl or lctl) can tell me why.
>
> The compute node's lnet configuration is:
>
> [root at ulna66:lustre-211]# cat /etc/lnet.conf
> ip2nets:
> - net-spec: o2ib33
> interfaces:
> 0: hsi0
> ip-range:
> 0: 192.168.128.*
> route:
> - net: o2ib100
> gateway: 192.168.128.4 at o2ib33
>
> After I start lnet, systemctl reports success and the state is as follows:
>
> [root at ulna66:lustre-211]# lnetctl net show
> net:
> - net type: lo
> local NI(s):
> - nid: 0 at lo
> status: up
> - net type: o2ib33
> local NI(s):
> - nid: 192.168.128.66 at o2ib33
> status: up
> interfaces:
> 0: hsi0
>
> [root at ulna66:lustre-211]# lnetctl peer show --verbose
> peer:
> - primary nid: 192.168.128.4 at o2ib33
> Multi-Rail: False
> peer ni:
> - nid: 192.168.128.4 at o2ib33
> state: up
> max_ni_tx_credits: 8
> available_tx_credits: 8
> min_tx_credits: 7
> tx_q_num_of_buf: 0
> available_rtr_credits: 8
> min_rtr_credits: 8
> refcount: 4
> statistics:
> send_count: 2
> recv_count: 2
> drop_count: 0
>
> [root at ulna66:lustre-211]# lnetctl route show --verbose
> route:
> - net: o2ib100
> gateway: 192.168.128.4 at o2ib33
> hop: -1
> priority: 0
> state: down
>
> I can instrument the code, but I figure there must be someplace available to normal users to look, that I'm unaware of.
>
> thanks,
>
> Olaf P. Faaland
> Livermore Computing
> _______________________________________________
> lustre-discuss mailing list
> lustre-discuss at lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
> _______________________________________________
> lustre-discuss mailing list
> lustre-discuss at lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Intel Corporation
More information about the lustre-discuss
mailing list