[lustre-discuss] lustre client can't moutn after configuring LNET with lnetctl
Riccardo Veraldi
Riccardo.Veraldi at cnaf.infn.it
Fri Sep 29 06:17:07 PDT 2017
On 9/28/17 8:29 PM, Dilger, Andreas wrote:
> Riccardo,
> I'm not an LNet expert, but a number of LNet multi-rail fixes are landed or being worked on for Lustre 2.10.1. You might try testing the current b2_10 to see if that resolves your problems.
You are right I might end up with that. Sorry but I did not understand
if 2.10.1 is officially out or if it is release candidate.
thanks
>
> Cheers, Andreas
>
> On Sep 27, 2017, at 21:22, Riccardo Veraldi <Riccardo.Veraldi at cnaf.infn.it> wrote:
>> Hello.
>>
>> I configure Multi-rail on my lustre environment.
>>
>> MDS: 172.21.42.213 at tcp
>> OSS: 172.21.52.118 at o2ib
>> 172.21.52.86 at o2ib
>> Client: 172.21.52.124 at o2ib
>> 172.21.52.125 at o2ib
>>
>>
>> [root at drp-tst-oss10:~]# cat /proc/sys/lnet/peers
>> nid refs state last max rtr min tx min
>> queue
>> 172.21.52.124 at o2ib 1 NA -1 128 128 128 128 128 0
>> 172.21.52.125 at o2ib 1 NA -1 128 128 128 128 128 0
>> 172.21.42.213 at tcp 1 NA -1 8 8 8 8 6 0
>>
>> after configuring multi-rail I can see both infiniband interfaces peers on the OSS and on the client side.
>> Anyway before multi-rail lustre client could mount the lustre FS without problems.
>> Now after multi-rail is set up the client cannot mount anymore the filesystem.
>>
>> When I mount lustre from the client (fstab entry):
>>
>> 172.21.42.213 at tcp:/drplu /drplu lustre noauto,lazystatfs,flock, 0 0
>>
>> the file system cannot be mounted and I got these errors
>>
>> Sep 27 18:28:46 drp-tst-lu10 kernel: [ 596.842861] Lustre:
>> 2490:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has
>> failed due to network error: [sent 1506562126/real 1506562126]
>> req at ffff8808326b2a00 x1579744801849904/t0(0)
>> o400->
>> drplu-OST0001-osc-ffff88085d134800 at 172.21.52.86@o2ib:28/4
>> lens
>> 224/224 e 0 to 1 dl 1506562133 ref 1 fl Rpc:eXN/0/ffffffff rc 0/-1
>> Sep 27 18:28:46 drp-tst-lu10 kernel: [ 596.842872] Lustre:
>> drplu-OST0001-osc-ffff88085d134800: Connection to drplu-OST0001 (at
>> 172.21.52.86 at o2ib) was lost; in progress operations using this service
>> will wait for recovery to complete
>> Sep 27 18:28:46 drp-tst-lu10 kernel: [ 596.843306] Lustre:
>> drplu-OST0001-osc-ffff88085d134800: Connection restored to
>> 172.21.52.86 at o2ib (at 172.21.52.86 at o2ib)
>>
>>
>> the mount point appears and disappears every few seconds from "df"
>>
>> I do not have a clue on how to fix. The multi rail capability is important for me.
>>
>> I have Lustre 2.10.0 both client side and server side.
>> here is my lnet.conf on the lustre client side. The one OSS side is
>> similar just swapped peers for o2ib net.
>>
>> net:
>> - net type: lo
>> local NI(s):
>> - nid: 0 at lo
>> status: up
>> statistics:
>> send_count: 0
>> recv_count: 0
>> drop_count: 0
>> tunables:
>> peer_timeout: 0
>> peer_credits: 0
>> peer_buffer_credits: 0
>> credits: 0
>> lnd tunables:
>> tcp bonding: 0
>> dev cpt: 0
>> CPT: "[0]"
>> - net type: o2ib
>> local NI(s):
>> - nid: 172.21.52.124 at o2ib
>> status: up
>> interfaces:
>> 0: ib0
>> statistics:
>> send_count: 7
>> recv_count: 7
>> drop_count: 0
>> tunables:
>> peer_timeout: 180
>> peer_credits: 128
>> peer_buffer_credits: 0
>> credits: 1024
>> lnd tunables:
>> peercredits_hiw: 64
>> map_on_demand: 32
>> concurrent_sends: 256
>> fmr_pool_size: 2048
>> fmr_flush_trigger: 512
>> fmr_cache: 1
>> ntx: 2048
>> conns_per_peer: 4
>> tcp bonding: 0
>> dev cpt: -1
>> CPT: "[0]"
>> - nid: 172.21.52.125 at o2ib
>> status: up
>> interfaces:
>> 0: ib1
>> statistics:
>> send_count: 5
>> recv_count: 5
>> drop_count: 0
>> tunables:
>> peer_timeout: 180
>> peer_credits: 128
>> peer_buffer_credits: 0
>> credits: 1024
>> lnd tunables:
>> peercredits_hiw: 64
>> map_on_demand: 32
>> concurrent_sends: 256
>> fmr_pool_size: 2048
>> fmr_flush_trigger: 512
>> fmr_cache: 1
>> ntx: 2048
>> conns_per_peer: 4
>> tcp bonding: 0
>> dev cpt: -1
>> CPT: "[0]"
>> - net type: tcp
>> local NI(s):
>> - nid: 172.21.42.195 at tcp
>> status: up
>> interfaces:
>> 0: enp7s0f0
>> statistics:
>> send_count: 51
>> recv_count: 51
>> drop_count: 0
>> tunables:
>> peer_timeout: 180
>> peer_credits: 8
>> peer_buffer_credits: 0
>> credits: 256
>> lnd tunables:
>> tcp bonding: 0
>> dev cpt: -1
>> CPT: "[0]"
>> peer:
>> - primary nid: 172.21.42.213 at tcp
>> Multi-Rail: False
>> peer ni:
>> - nid: 172.21.42.213 at tcp
>> state: NA
>> max_ni_tx_credits: 8
>> available_tx_credits: 8
>> min_tx_credits: 6
>> tx_q_num_of_buf: 0
>> available_rtr_credits: 8
>> min_rtr_credits: 8
>> send_count: 0
>> recv_count: 0
>> drop_count: 0
>> refcount: 1
>> - primary nid: 172.21.52.86 at o2ib
>> Multi-Rail: True
>> peer ni:
>> - nid: 172.21.52.86 at o2ib
>> state: NA
>> max_ni_tx_credits: 128
>> available_tx_credits: 128
>> min_tx_credits: 128
>> tx_q_num_of_buf: 0
>> available_rtr_credits: 128
>> min_rtr_credits: 128
>> send_count: 0
>> recv_count: 0
>> drop_count: 0
>> refcount: 1
>> - nid: 172.21.52.118 at o2ib
>> state: NA
>> max_ni_tx_credits: 128
>> available_tx_credits: 128
>> min_tx_credits: 128
>> tx_q_num_of_buf: 0
>> available_rtr_credits: 128
>> min_rtr_credits: 128
>> send_count: 0
>> recv_count: 0
>> drop_count: 0
>> refcount: 1
>>
>> thank you very much for any hint you may give.
>>
>> Rick
>>
>>
>> _______________________________________________
>> lustre-discuss mailing list
>> lustre-discuss at lists.lustre.org
>> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
> Cheers, Andreas
> --
> Andreas Dilger
> Lustre Principal Architect
> Intel Corporation
>
>
>
>
>
>
>
>
More information about the lustre-discuss
mailing list