[lustre-discuss] Lustre 2.10.1 error while mounting multi-rail

Riccardo Veraldi Riccardo.Veraldi at cnaf.infn.it
Mon Oct 9 16:44:00 PDT 2017


Hello.

Here I am again trying to have multi-rail work.

I configured multi-rail on OSS and clients side.

I have one OSS, one MDS and one client, RHEL74 and Lustre 2.10.1:

  * psdrp-tst-mds10 MDS
  * drp-tst-oss10 OSS  (172.21.52.86 at o2ib  172.21.52.118 at o2ib)
  * drp-tst-lu10 Lustre client (172.21.52.124 at o2ib  172.21.52.125 at o2ib)

without Multi-Rail everything works fine.

What I Am doing is to aggregate two IB interface to being able to have
more performance. When anyway I mount the lustre partition from the
Lsutre client I got this error and the partition does not mount:

Oct  9 16:23:50 drp-tst-lu10 kernel: [248177.914832] LNetError:
1895:0:(o2iblnd_cb.c:2726:kiblnd_rejected()) 172.21.52.118 at o2ib
rejected: consumer defined fatal error
Oct  9 16:23:50 drp-tst-lu10 kernel: [248177.917290] Lustre: Mounted
drplu-client
Oct  9 16:23:50 drp-tst-lu10 kernel: [248177.920832] Lustre:
31785:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has
failed due to network error: [sent 1507591430/real 1507591430] 
req at ffff8807f56a0300 x1580812428378832/t0(0)
o8->drplu-OST0001-osc-ffff88084738d800 at 172.21.52.86@o2ib:28/4 lens
520/544 e 0 to 1 dl 1507591435 ref 1 fl Rpc:eXN/0/ffffffff rc 0/-1
Oct  9 16:23:52 drp-tst-lu10 kernel: [248179.936156] LustreError:
673:0:(llite_lib.c:1748:ll_statfs_internal()) obd_statfs fails: rc = -5
Oct  9 16:23:57 drp-tst-lu10 kernel: [248184.645463] LustreError:
674:0:(llite_lib.c:1748:ll_statfs_internal()) obd_statfs fails: rc = -5
Oct  9 16:23:58 drp-tst-lu10 kernel: [248186.117364] LustreError:
678:0:(llite_lib.c:1748:ll_statfs_internal()) obd_statfs fails: rc = -5
Oct  9 16:23:58 drp-tst-lu10 kernel: [248186.117411] LustreError:
678:0:(llite_lib.c:1748:ll_statfs_internal()) Skipped 1 previous similar
message
Oct  9 16:24:15 drp-tst-lu10 kernel: [248202.912554] LNetError:
1895:0:(o2iblnd_cb.c:2726:kiblnd_rejected()) 172.21.52.118 at o2ib
rejected: consumer defined fatal error
Oct  9 16:24:15 drp-tst-lu10 kernel: [248202.912610] LNetError:
1895:0:(o2iblnd_cb.c:2726:kiblnd_rejected()) Skipped 3 previous similar
messages
Oct  9 16:24:15 drp-tst-lu10 kernel: [248202.918903] Lustre:
31785:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has
failed due to network error: [sent 1507591455/real 1507591455] 
req at ffff88075d2ee700 x1580812428378960/t0(0)
o8->drplu-OST0001-osc-ffff88084738d800 at 172.21.52.86@o2ib:28/4 lens
520/544 e 0 to 1 dl 1507591465 ref 1 fl Rpc:eXN/0/ffffffff rc 0/-1
Oct  9 16:23:52 drp-tst-lu10 kernel: [248179.936156] LustreError:
673:0:(llite_lib.c:1748:ll_statfs_internal()) obd_statfs fails: rc = -5

fstab entry: 172.21.42.213 at tcp:/drplu /drplu lustre
noauto,lazystatfs,flock, 0 0

I can see the peers in the lnet status:

[root at drp-tst-oss10:~]# cat /proc/sys/lnet/peers
nid                      refs state  last   max   rtr   min    tx   min
queue
172.21.52.124 at o2ib          1    NA    -1   128   128   128   128   128 0
172.21.52.125 at o2ib          1    NA    -1   128   128   128   128   128 0
172.21.42.213 at tcp           1    NA    -1     8     8     8     8     6 0



[root at drp-tst-lu10:etc]# cat /proc/sys/lnet/peers
nid                      refs state  last   max   rtr   min    tx   min
queue
172.21.52.118 at o2ib          1    NA    -1   128   128   128   128   127 0
172.21.52.86 at o2ib           1    NA    -1   128   128   128   128   102 0
172.21.42.213 at tcp           1    NA    -1     8     8     8     8     6 0


here is my lnet configuration with multi-rail on the OSS side


[root at drp-tst-oss10:veraldi]# lnetctl export
net:
    - net type: lo
      local NI(s):
        - nid: 0 at lo
          status: up
          statistics:
              send_count: 0
              recv_count: 0
              drop_count: 0
          tunables:
              peer_timeout: 0
              peer_credits: 0
              peer_buffer_credits: 0
              credits: 0
          lnd tunables:
          tcp bonding: 0
          dev cpt: 0
          CPT: "[0,1]"
    - net type: o2ib
      local NI(s):
        - nid: 172.21.52.86 at o2ib
          status: up
          interfaces:
              0: ib0
          statistics:
              send_count: 0
              recv_count: 0
              drop_count: 0
          tunables:
              peer_timeout: 180
              peer_credits: 128
              peer_buffer_credits: 0
              credits: 1024
          lnd tunables:
              peercredits_hiw: 64
              map_on_demand: 32
              concurrent_sends: 256
              fmr_pool_size: 2048
              fmr_flush_trigger: 512
              fmr_cache: 1
              ntx: 2048
              conns_per_peer: 4
          tcp bonding: 0
          dev cpt: 1
          CPT: "[0,1]"
        - nid: 172.21.52.118 at o2ib
          status: up
          interfaces:
              0: ib1
          statistics:
              send_count: 0
              recv_count: 0
              drop_count: 0
          tunables:
              peer_timeout: 180
              peer_credits: 128
              peer_buffer_credits: 0
              credits: 1024
          lnd tunables:
              peercredits_hiw: 64
              map_on_demand: 32
              concurrent_sends: 256
              fmr_pool_size: 2048
              fmr_flush_trigger: 512
              fmr_cache: 1
              ntx: 2048
              conns_per_peer: 4
          tcp bonding: 0
          dev cpt: 1
          CPT: "[0,1]"
    - net type: tcp
      local NI(s):
        - nid: 172.21.42.211 at tcp
          status: up
          interfaces:
              0: enp1s0f0
          statistics:
              send_count: 198
              recv_count: 198
              drop_count: 0
          tunables:
              peer_timeout: 180
              peer_credits: 8
              peer_buffer_credits: 0
              credits: 256
          lnd tunables:
          tcp bonding: 0
          dev cpt: 0
          CPT: "[0,1]"
peer:
    - primary nid: 172.21.42.213 at tcp
      Multi-Rail: True
      peer ni:
        - nid: 172.21.42.213 at tcp
          state: NA
          max_ni_tx_credits: 8
          available_tx_credits: 8
          min_tx_credits: 6
          tx_q_num_of_buf: 0
          available_rtr_credits: 8
          min_rtr_credits: 8
          send_count: 198
          recv_count: 198
          drop_count: 0
          refcount: 1
    - primary nid: 172.21.52.124 at o2ib
      Multi-Rail: True
      peer ni:
        - nid: 172.21.52.124 at o2ib
          state: NA
          max_ni_tx_credits: 128
          available_tx_credits: 128
          min_tx_credits: 128
          tx_q_num_of_buf: 0
          available_rtr_credits: 128
          min_rtr_credits: 128
          send_count: 0
          recv_count: 0
          drop_count: 0
          refcount: 1
        - nid: 172.21.52.125 at o2ib
          state: NA
          max_ni_tx_credits: 128
          available_tx_credits: 128
          min_tx_credits: 128
          tx_q_num_of_buf: 0
          available_rtr_credits: 128
          min_rtr_credits: 128
          send_count: 0
          recv_count: 0
          drop_count: 0
          refcount: 1
numa:
    range: 0



here the lnet configuration client side:


[root at drp-tst-lu10:veraldi]# lnetctl export
net:
    - net type: lo
      local NI(s):
        - nid: 0 at lo
          status: up
          statistics:
              send_count: 0
              recv_count: 0
              drop_count: 0
          tunables:
              peer_timeout: 0
              peer_credits: 0
              peer_buffer_credits: 0
              credits: 0
          lnd tunables:
          tcp bonding: 0
          dev cpt: 0
          CPT: "[0]"
    - net type: o2ib
      local NI(s):
        - nid: 172.21.52.124 at o2ib
          status: up
          interfaces:
              0: ib0
          statistics:
              send_count: 403742
              recv_count: 807391
              drop_count: 0
          tunables:
              peer_timeout: 180
              peer_credits: 128
              peer_buffer_credits: 0
              credits: 1024
          lnd tunables:
              peercredits_hiw: 64
              map_on_demand: 32
              concurrent_sends: 256
              fmr_pool_size: 2048
              fmr_flush_trigger: 512
              fmr_cache: 1
              ntx: 2048
              conns_per_peer: 4
          tcp bonding: 0
          dev cpt: -1
          CPT: "[0]"
        - nid: 172.21.52.125 at o2ib
          status: up
          interfaces:
              0: ib1
          statistics:
              send_count: 0
              recv_count: 0
              drop_count: 0
          tunables:
              peer_timeout: 180
              peer_credits: 128
              peer_buffer_credits: 0
              credits: 1024
          lnd tunables:
              peercredits_hiw: 64
              map_on_demand: 32
              concurrent_sends: 256
              fmr_pool_size: 2048
              fmr_flush_trigger: 512
              fmr_cache: 1
              ntx: 2048
              conns_per_peer: 4
          tcp bonding: 0
          dev cpt: -1
          CPT: "[0]"
    - net type: tcp
      local NI(s):
        - nid: 172.21.42.195 at tcp
          status: up
          interfaces:
              0: enp7s0f0
          statistics:
              send_count: 99
              recv_count: 99
              drop_count: 0
          tunables:
              peer_timeout: 180
              peer_credits: 8
              peer_buffer_credits: 0
              credits: 256
          lnd tunables:
          tcp bonding: 0
          dev cpt: -1
          CPT: "[0]"
peer:
    - primary nid: 172.21.42.213 at tcp
      Multi-Rail: True
      peer ni:
        - nid: 172.21.42.213 at tcp
          state: NA
          max_ni_tx_credits: 8
          available_tx_credits: 8
          min_tx_credits: 6
          tx_q_num_of_buf: 0
          available_rtr_credits: 8
          min_rtr_credits: 8
          send_count: 99
          recv_count: 99
          drop_count: 0
          refcount: 1
    - primary nid: 172.21.52.86 at o2ib
      Multi-Rail: True
      peer ni:
        - nid: 172.21.52.86 at o2ib
          state: NA
          max_ni_tx_credits: 128
          available_tx_credits: 128
          min_tx_credits: 102
          tx_q_num_of_buf: 0
          available_rtr_credits: 128
          min_rtr_credits: 128
          send_count: 403742
          recv_count: 807391
          drop_count: 0
          refcount: 1
        - nid: 172.21.52.118 at o2ib
          state: NA
          max_ni_tx_credits: 128
          available_tx_credits: 128
          min_tx_credits: 127
          tx_q_num_of_buf: 0
          available_rtr_credits: 128
          min_rtr_credits: 128
          send_count: 0
          recv_count: 0
          drop_count: 0
          refcount: 1
numa:
    range: 0


anyway Lustre does not work. This is really weird. it should.

Any hints ?

thank you


Rick


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20171009/b0993c52/attachment-0001.html>


More information about the lustre-discuss mailing list