[lustre-discuss] lustre client can't moutn after configuring LNET with lnetctl

Dilger, Andreas andreas.dilger at intel.com
Thu Sep 28 20:29:06 PDT 2017


Riccardo,
I'm not an LNet expert, but a number of LNet multi-rail fixes are landed or being worked on for Lustre 2.10.1.  You might try testing the current b2_10 to see if that resolves your problems.

Cheers, Andreas

On Sep 27, 2017, at 21:22, Riccardo Veraldi <Riccardo.Veraldi at cnaf.infn.it> wrote:
> 
> Hello.
> 
> I configure Multi-rail on my lustre environment.
> 
> MDS: 172.21.42.213 at tcp
> OSS: 172.21.52.118 at o2ib
>         172.21.52.86 at o2ib
> Client: 172.21.52.124 at o2ib
>         172.21.52.125 at o2ib
> 
>  
> [root at drp-tst-oss10:~]# cat /proc/sys/lnet/peers
> nid                      refs state  last   max   rtr   min    tx   min
> queue
> 172.21.52.124 at o2ib          1    NA    -1   128   128   128   128   128 0
> 172.21.52.125 at o2ib          1    NA    -1   128   128   128   128   128 0
> 172.21.42.213 at tcp           1    NA    -1     8     8     8     8     6 0
> 
> after configuring multi-rail I can see both infiniband interfaces peers on the OSS and on the client side. 
> Anyway before multi-rail lustre client could mount the lustre FS without problems.
> Now after multi-rail is set up the client cannot mount anymore the filesystem.
> 
> When I mount lustre from the client (fstab entry):
> 
> 172.21.42.213 at tcp:/drplu /drplu lustre noauto,lazystatfs,flock, 0 0
> 
> the file system cannot be mounted and I got these errors
> 
> Sep 27 18:28:46 drp-tst-lu10 kernel: [  596.842861] Lustre:
> 2490:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has
> failed due to network error: [sent 1506562126/real 1506562126] 
> req at ffff8808326b2a00 x1579744801849904/t0(0)
> o400->
> drplu-OST0001-osc-ffff88085d134800 at 172.21.52.86@o2ib:28/4
>  lens
> 224/224 e 0 to 1 dl 1506562133 ref 1 fl Rpc:eXN/0/ffffffff rc 0/-1
> Sep 27 18:28:46 drp-tst-lu10 kernel: [  596.842872] Lustre:
> drplu-OST0001-osc-ffff88085d134800: Connection to drplu-OST0001 (at
> 172.21.52.86 at o2ib) was lost; in progress operations using this service
> will wait for recovery to complete
> Sep 27 18:28:46 drp-tst-lu10 kernel: [  596.843306] Lustre:
> drplu-OST0001-osc-ffff88085d134800: Connection restored to
> 172.21.52.86 at o2ib (at 172.21.52.86 at o2ib)
> 
> 
> the mount point appears and disappears every few seconds from "df"
> 
> I do not have a clue on how to fix. The multi rail capability is important for me.
> 
> I have Lustre 2.10.0 both client side and server side.
> here is my lnet.conf on the lustre client side. The one OSS side is
> similar just swapped peers for o2ib net.
> 
> net:
>     - net type: lo
>       local NI(s):
>         - nid: 0 at lo
>           status: up
>           statistics:
>               send_count: 0
>               recv_count: 0
>               drop_count: 0
>           tunables:
>               peer_timeout: 0
>               peer_credits: 0
>               peer_buffer_credits: 0
>               credits: 0
>           lnd tunables:
>           tcp bonding: 0
>           dev cpt: 0
>           CPT: "[0]"
>     - net type: o2ib
>       local NI(s):
>         - nid: 172.21.52.124 at o2ib
>           status: up
>           interfaces:
>               0: ib0
>           statistics:
>               send_count: 7
>               recv_count: 7
>               drop_count: 0
>           tunables:
>               peer_timeout: 180
>               peer_credits: 128
>               peer_buffer_credits: 0
>               credits: 1024
>           lnd tunables:
>               peercredits_hiw: 64
>               map_on_demand: 32
>               concurrent_sends: 256
>               fmr_pool_size: 2048
>               fmr_flush_trigger: 512
>               fmr_cache: 1
>               ntx: 2048
>               conns_per_peer: 4
>           tcp bonding: 0
>           dev cpt: -1
>           CPT: "[0]"
>         - nid: 172.21.52.125 at o2ib
>           status: up
>           interfaces:
>               0: ib1
>           statistics:
>               send_count: 5
>               recv_count: 5
>               drop_count: 0
>           tunables:
>               peer_timeout: 180
>               peer_credits: 128
>               peer_buffer_credits: 0
>               credits: 1024
>           lnd tunables:
>               peercredits_hiw: 64
>               map_on_demand: 32
>               concurrent_sends: 256
>               fmr_pool_size: 2048
>               fmr_flush_trigger: 512
>               fmr_cache: 1
>               ntx: 2048
>               conns_per_peer: 4
>           tcp bonding: 0
>           dev cpt: -1
>           CPT: "[0]"
>     - net type: tcp
>       local NI(s):
>         - nid: 172.21.42.195 at tcp
>           status: up
>           interfaces:
>               0: enp7s0f0
>           statistics:
>               send_count: 51
>               recv_count: 51
>               drop_count: 0
>           tunables:
>               peer_timeout: 180
>               peer_credits: 8
>               peer_buffer_credits: 0
>               credits: 256
>           lnd tunables:
>           tcp bonding: 0
>           dev cpt: -1
>           CPT: "[0]"
> peer:
>     - primary nid: 172.21.42.213 at tcp
>       Multi-Rail: False
>       peer ni:
>         - nid: 172.21.42.213 at tcp
>           state: NA
>           max_ni_tx_credits: 8
>           available_tx_credits: 8
>           min_tx_credits: 6
>           tx_q_num_of_buf: 0
>           available_rtr_credits: 8
>           min_rtr_credits: 8
>           send_count: 0
>           recv_count: 0
>           drop_count: 0
>           refcount: 1
>     - primary nid: 172.21.52.86 at o2ib
>       Multi-Rail: True
>       peer ni:
>         - nid: 172.21.52.86 at o2ib
>           state: NA
>           max_ni_tx_credits: 128
>           available_tx_credits: 128
>           min_tx_credits: 128
>           tx_q_num_of_buf: 0
>           available_rtr_credits: 128
>           min_rtr_credits: 128
>           send_count: 0
>           recv_count: 0
>           drop_count: 0
>           refcount: 1
>         - nid: 172.21.52.118 at o2ib
>           state: NA
>           max_ni_tx_credits: 128
>           available_tx_credits: 128
>           min_tx_credits: 128
>           tx_q_num_of_buf: 0
>           available_rtr_credits: 128
>           min_rtr_credits: 128
>           send_count: 0
>           recv_count: 0
>           drop_count: 0
>           refcount: 1
> 
> thank you very much for any hint you may give.
> 
> Rick
> 
> 
> _______________________________________________
> lustre-discuss mailing list
> lustre-discuss at lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Intel Corporation









More information about the lustre-discuss mailing list