[lustre-discuss] lustre client can't moutn after configuring LNET with lnetctl

Riccardo Veraldi Riccardo.Veraldi at cnaf.infn.it
Wed Sep 27 20:22:13 PDT 2017


Hello.

I configure Multi-rail on my lustre environment.

MDS: 172.21.42.213 at tcp
OSS: 172.21.52.118 at o2ib
        172.21.52.86 at o2ib
Client: 172.21.52.124 at o2ib
        172.21.52.125 at o2ib

 
[root at drp-tst-oss10:~]# cat /proc/sys/lnet/peers
nid                      refs state  last   max   rtr   min    tx   min
queue
172.21.52.124 at o2ib          1    NA    -1   128   128   128   128   128 0
172.21.52.125 at o2ib          1    NA    -1   128   128   128   128   128 0
172.21.42.213 at tcp           1    NA    -1     8     8     8     8     6 0

after configuring multi-rail I can see both infiniband interfaces peers on the OSS and on the client side. 
Anyway before multi-rail lustre client could mount the lustre FS without problems.
Now after multi-rail is set up the client cannot mount anymore the filesystem.

When I mount lustre from the client (fstab entry):

172.21.42.213 at tcp:/drplu /drplu lustre noauto,lazystatfs,flock, 0 0

the file system cannot be mounted and I got these errors

Sep 27 18:28:46 drp-tst-lu10 kernel: [  596.842861] Lustre:
2490:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has
failed due to network error: [sent 1506562126/real 1506562126] 
req at ffff8808326b2a00 x1579744801849904/t0(0)
o400->drplu-OST0001-osc-ffff88085d134800 at 172.21.52.86@o2ib:28/4 lens
224/224 e 0 to 1 dl 1506562133 ref 1 fl Rpc:eXN/0/ffffffff rc 0/-1
Sep 27 18:28:46 drp-tst-lu10 kernel: [  596.842872] Lustre:
drplu-OST0001-osc-ffff88085d134800: Connection to drplu-OST0001 (at
172.21.52.86 at o2ib) was lost; in progress operations using this service
will wait for recovery to complete
Sep 27 18:28:46 drp-tst-lu10 kernel: [  596.843306] Lustre:
drplu-OST0001-osc-ffff88085d134800: Connection restored to
172.21.52.86 at o2ib (at 172.21.52.86 at o2ib)


the mount point appears and disappears every few seconds from "df"

I do not have a clue on how to fix. The multi rail capability is important for me.

I have Lustre 2.10.0 both client side and server side.
here is my lnet.conf on the lustre client side. The one OSS side is
similar just swapped peers for o2ib net.

net:
    - net type: lo
      local NI(s):
        - nid: 0 at lo
          status: up
          statistics:
              send_count: 0
              recv_count: 0
              drop_count: 0
          tunables:
              peer_timeout: 0
              peer_credits: 0
              peer_buffer_credits: 0
              credits: 0
          lnd tunables:
          tcp bonding: 0
          dev cpt: 0
          CPT: "[0]"
    - net type: o2ib
      local NI(s):
        - nid: 172.21.52.124 at o2ib
          status: up
          interfaces:
              0: ib0
          statistics:
              send_count: 7
              recv_count: 7
              drop_count: 0
          tunables:
              peer_timeout: 180
              peer_credits: 128
              peer_buffer_credits: 0
              credits: 1024
          lnd tunables:
              peercredits_hiw: 64
              map_on_demand: 32
              concurrent_sends: 256
              fmr_pool_size: 2048
              fmr_flush_trigger: 512
              fmr_cache: 1
              ntx: 2048
              conns_per_peer: 4
          tcp bonding: 0
          dev cpt: -1
          CPT: "[0]"
        - nid: 172.21.52.125 at o2ib
          status: up
          interfaces:
              0: ib1
          statistics:
              send_count: 5
              recv_count: 5
              drop_count: 0
          tunables:
              peer_timeout: 180
              peer_credits: 128
              peer_buffer_credits: 0
              credits: 1024
          lnd tunables:
              peercredits_hiw: 64
              map_on_demand: 32
              concurrent_sends: 256
              fmr_pool_size: 2048
              fmr_flush_trigger: 512
              fmr_cache: 1
              ntx: 2048
              conns_per_peer: 4
          tcp bonding: 0
          dev cpt: -1
          CPT: "[0]"
    - net type: tcp
      local NI(s):
        - nid: 172.21.42.195 at tcp
          status: up
          interfaces:
              0: enp7s0f0
          statistics:
              send_count: 51
              recv_count: 51
              drop_count: 0
          tunables:
              peer_timeout: 180
              peer_credits: 8
              peer_buffer_credits: 0
              credits: 256
          lnd tunables:
          tcp bonding: 0
          dev cpt: -1
          CPT: "[0]"
peer:
    - primary nid: 172.21.42.213 at tcp
      Multi-Rail: False
      peer ni:
        - nid: 172.21.42.213 at tcp
          state: NA
          max_ni_tx_credits: 8
          available_tx_credits: 8
          min_tx_credits: 6
          tx_q_num_of_buf: 0
          available_rtr_credits: 8
          min_rtr_credits: 8
          send_count: 0
          recv_count: 0
          drop_count: 0
          refcount: 1
    - primary nid: 172.21.52.86 at o2ib
      Multi-Rail: True
      peer ni:
        - nid: 172.21.52.86 at o2ib
          state: NA
          max_ni_tx_credits: 128
          available_tx_credits: 128
          min_tx_credits: 128
          tx_q_num_of_buf: 0
          available_rtr_credits: 128
          min_rtr_credits: 128
          send_count: 0
          recv_count: 0
          drop_count: 0
          refcount: 1
        - nid: 172.21.52.118 at o2ib
          state: NA
          max_ni_tx_credits: 128
          available_tx_credits: 128
          min_tx_credits: 128
          tx_q_num_of_buf: 0
          available_rtr_credits: 128
          min_rtr_credits: 128
          send_count: 0
          recv_count: 0
          drop_count: 0
          refcount: 1

thank you very much for any hint you may give.

Rick
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20170927/0c4d5c88/attachment.htm>


More information about the lustre-discuss mailing list