[lustre-discuss] Lustre 2.11 lnet troubleshooting

Faaland, Olaf P. faaland1 at llnl.gov
Tue Apr 17 18:00:10 PDT 2018


So the problem was inded that "routing" was disabled on the router node.  I added "routing: 1" to the lnet.conf file for the routers and lctl ping works as expected.

The question about the lnet module option "forwarding" still stands.  The lnet module still accepts a parameter, "forwarding", but it doesn't do what it used to.   Is that just a leftover that needs to be cleaned up?

thanks,

Olaf P. Faaland
Livermore Computing

________________________________________
From: Faaland, Olaf P.
Sent: Tuesday, April 17, 2018 5:05 PM
To: lustre-discuss at lists.lustre.org
Subject: Re: Lustre 2.11 lnet troubleshooting

Update:

Joe pointed out "lnetctl set routing 1".  After invoking that on the router node, the compute node reports the route as up:

[root at ulna66:lustre-211]# lnetctl route show -v
route:
    - net: o2ib100
      gateway: 192.168.128.4 at o2ib33
      hop: -1
      priority: 0
      state: up

Does this replace the lnet module parameter "forwarding"?

Olaf P. Faaland
Livermore Computing


________________________________________
From: lustre-discuss <lustre-discuss-bounces at lists.lustre.org> on behalf of Faaland, Olaf P. <faaland1 at llnl.gov>
Sent: Tuesday, April 17, 2018 4:34:22 PM
To: lustre-discuss at lists.lustre.org
Subject: [lustre-discuss] Lustre 2.11 lnet troubleshooting

Hi,

I've got a cluster running 2.11 with 2 routers and 68  compute nodes.  It's the first time I've used a post-multi-rail version of Lustre.

The problem I'm trying to troubleshoot is that my sample compute node (ulna66) seems to think the router I configured (ulna4) is down, and so an attempt to ping outside the cluster results in failure and "no route to XXX" on the console.  I can lctl ping the router from the compute node and vice-versa.   Forwarding is enabled on the router node via modprobe argument.

lnetctl route show reports that the route is down.  Where I'm stuck is figuring out what in userspace (e.g. lnetctl or lctl) can tell me why.

The compute node's lnet configuration is:

[root at ulna66:lustre-211]# cat /etc/lnet.conf
ip2nets:
  - net-spec: o2ib33
    interfaces:
         0: hsi0
    ip-range:
         0: 192.168.128.*
route:
    - net: o2ib100
      gateway: 192.168.128.4 at o2ib33

After I start lnet, systemctl reports success and the state is as follows:

[root at ulna66:lustre-211]# lnetctl net show
net:
    - net type: lo
      local NI(s):
        - nid: 0 at lo
          status: up
    - net type: o2ib33
      local NI(s):
        - nid: 192.168.128.66 at o2ib33
          status: up
          interfaces:
              0: hsi0

[root at ulna66:lustre-211]# lnetctl peer show --verbose
peer:
    - primary nid: 192.168.128.4 at o2ib33
      Multi-Rail: False
      peer ni:
        - nid: 192.168.128.4 at o2ib33
          state: up
          max_ni_tx_credits: 8
          available_tx_credits: 8
          min_tx_credits: 7
          tx_q_num_of_buf: 0
          available_rtr_credits: 8
          min_rtr_credits: 8
          refcount: 4
          statistics:
              send_count: 2
              recv_count: 2
              drop_count: 0

[root at ulna66:lustre-211]# lnetctl route show --verbose
route:
    - net: o2ib100
      gateway: 192.168.128.4 at o2ib33
      hop: -1
      priority: 0
      state: down

I can instrument the code, but I figure there must be someplace available to normal users to look, that I'm unaware of.

thanks,

Olaf P. Faaland
Livermore Computing
_______________________________________________
lustre-discuss mailing list
lustre-discuss at lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


More information about the lustre-discuss mailing list