[lustre-discuss] lustre client can't moutn after configuring LNET with lnetctl
Dilger, Andreas
andreas.dilger at intel.com
Thu Sep 28 20:29:06 PDT 2017
Riccardo,
I'm not an LNet expert, but a number of LNet multi-rail fixes are landed or being worked on for Lustre 2.10.1. You might try testing the current b2_10 to see if that resolves your problems.
Cheers, Andreas
On Sep 27, 2017, at 21:22, Riccardo Veraldi <Riccardo.Veraldi at cnaf.infn.it> wrote:
>
> Hello.
>
> I configure Multi-rail on my lustre environment.
>
> MDS: 172.21.42.213 at tcp
> OSS: 172.21.52.118 at o2ib
> 172.21.52.86 at o2ib
> Client: 172.21.52.124 at o2ib
> 172.21.52.125 at o2ib
>
>
> [root at drp-tst-oss10:~]# cat /proc/sys/lnet/peers
> nid refs state last max rtr min tx min
> queue
> 172.21.52.124 at o2ib 1 NA -1 128 128 128 128 128 0
> 172.21.52.125 at o2ib 1 NA -1 128 128 128 128 128 0
> 172.21.42.213 at tcp 1 NA -1 8 8 8 8 6 0
>
> after configuring multi-rail I can see both infiniband interfaces peers on the OSS and on the client side.
> Anyway before multi-rail lustre client could mount the lustre FS without problems.
> Now after multi-rail is set up the client cannot mount anymore the filesystem.
>
> When I mount lustre from the client (fstab entry):
>
> 172.21.42.213 at tcp:/drplu /drplu lustre noauto,lazystatfs,flock, 0 0
>
> the file system cannot be mounted and I got these errors
>
> Sep 27 18:28:46 drp-tst-lu10 kernel: [ 596.842861] Lustre:
> 2490:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has
> failed due to network error: [sent 1506562126/real 1506562126]
> req at ffff8808326b2a00 x1579744801849904/t0(0)
> o400->
> drplu-OST0001-osc-ffff88085d134800 at 172.21.52.86@o2ib:28/4
> lens
> 224/224 e 0 to 1 dl 1506562133 ref 1 fl Rpc:eXN/0/ffffffff rc 0/-1
> Sep 27 18:28:46 drp-tst-lu10 kernel: [ 596.842872] Lustre:
> drplu-OST0001-osc-ffff88085d134800: Connection to drplu-OST0001 (at
> 172.21.52.86 at o2ib) was lost; in progress operations using this service
> will wait for recovery to complete
> Sep 27 18:28:46 drp-tst-lu10 kernel: [ 596.843306] Lustre:
> drplu-OST0001-osc-ffff88085d134800: Connection restored to
> 172.21.52.86 at o2ib (at 172.21.52.86 at o2ib)
>
>
> the mount point appears and disappears every few seconds from "df"
>
> I do not have a clue on how to fix. The multi rail capability is important for me.
>
> I have Lustre 2.10.0 both client side and server side.
> here is my lnet.conf on the lustre client side. The one OSS side is
> similar just swapped peers for o2ib net.
>
> net:
> - net type: lo
> local NI(s):
> - nid: 0 at lo
> status: up
> statistics:
> send_count: 0
> recv_count: 0
> drop_count: 0
> tunables:
> peer_timeout: 0
> peer_credits: 0
> peer_buffer_credits: 0
> credits: 0
> lnd tunables:
> tcp bonding: 0
> dev cpt: 0
> CPT: "[0]"
> - net type: o2ib
> local NI(s):
> - nid: 172.21.52.124 at o2ib
> status: up
> interfaces:
> 0: ib0
> statistics:
> send_count: 7
> recv_count: 7
> drop_count: 0
> tunables:
> peer_timeout: 180
> peer_credits: 128
> peer_buffer_credits: 0
> credits: 1024
> lnd tunables:
> peercredits_hiw: 64
> map_on_demand: 32
> concurrent_sends: 256
> fmr_pool_size: 2048
> fmr_flush_trigger: 512
> fmr_cache: 1
> ntx: 2048
> conns_per_peer: 4
> tcp bonding: 0
> dev cpt: -1
> CPT: "[0]"
> - nid: 172.21.52.125 at o2ib
> status: up
> interfaces:
> 0: ib1
> statistics:
> send_count: 5
> recv_count: 5
> drop_count: 0
> tunables:
> peer_timeout: 180
> peer_credits: 128
> peer_buffer_credits: 0
> credits: 1024
> lnd tunables:
> peercredits_hiw: 64
> map_on_demand: 32
> concurrent_sends: 256
> fmr_pool_size: 2048
> fmr_flush_trigger: 512
> fmr_cache: 1
> ntx: 2048
> conns_per_peer: 4
> tcp bonding: 0
> dev cpt: -1
> CPT: "[0]"
> - net type: tcp
> local NI(s):
> - nid: 172.21.42.195 at tcp
> status: up
> interfaces:
> 0: enp7s0f0
> statistics:
> send_count: 51
> recv_count: 51
> drop_count: 0
> tunables:
> peer_timeout: 180
> peer_credits: 8
> peer_buffer_credits: 0
> credits: 256
> lnd tunables:
> tcp bonding: 0
> dev cpt: -1
> CPT: "[0]"
> peer:
> - primary nid: 172.21.42.213 at tcp
> Multi-Rail: False
> peer ni:
> - nid: 172.21.42.213 at tcp
> state: NA
> max_ni_tx_credits: 8
> available_tx_credits: 8
> min_tx_credits: 6
> tx_q_num_of_buf: 0
> available_rtr_credits: 8
> min_rtr_credits: 8
> send_count: 0
> recv_count: 0
> drop_count: 0
> refcount: 1
> - primary nid: 172.21.52.86 at o2ib
> Multi-Rail: True
> peer ni:
> - nid: 172.21.52.86 at o2ib
> state: NA
> max_ni_tx_credits: 128
> available_tx_credits: 128
> min_tx_credits: 128
> tx_q_num_of_buf: 0
> available_rtr_credits: 128
> min_rtr_credits: 128
> send_count: 0
> recv_count: 0
> drop_count: 0
> refcount: 1
> - nid: 172.21.52.118 at o2ib
> state: NA
> max_ni_tx_credits: 128
> available_tx_credits: 128
> min_tx_credits: 128
> tx_q_num_of_buf: 0
> available_rtr_credits: 128
> min_rtr_credits: 128
> send_count: 0
> recv_count: 0
> drop_count: 0
> refcount: 1
>
> thank you very much for any hint you may give.
>
> Rick
>
>
> _______________________________________________
> lustre-discuss mailing list
> lustre-discuss at lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Intel Corporation
More information about the lustre-discuss
mailing list