[lustre-discuss] lustre client can't moutn after configuring LNET with lnetctl

Riccardo Veraldi Riccardo.Veraldi at cnaf.infn.it
Fri Sep 29 06:17:07 PDT 2017


On 9/28/17 8:29 PM, Dilger, Andreas wrote:
> Riccardo,
> I'm not an LNet expert, but a number of LNet multi-rail fixes are landed or being worked on for Lustre 2.10.1.  You might try testing the current b2_10 to see if that resolves your problems.
You are right I might end up with that. Sorry but I did not understand
if 2.10.1 is officially out or if it is release candidate.
thanks
>
> Cheers, Andreas
>
> On Sep 27, 2017, at 21:22, Riccardo Veraldi <Riccardo.Veraldi at cnaf.infn.it> wrote:
>> Hello.
>>
>> I configure Multi-rail on my lustre environment.
>>
>> MDS: 172.21.42.213 at tcp
>> OSS: 172.21.52.118 at o2ib
>>         172.21.52.86 at o2ib
>> Client: 172.21.52.124 at o2ib
>>         172.21.52.125 at o2ib
>>
>>  
>> [root at drp-tst-oss10:~]# cat /proc/sys/lnet/peers
>> nid                      refs state  last   max   rtr   min    tx   min
>> queue
>> 172.21.52.124 at o2ib          1    NA    -1   128   128   128   128   128 0
>> 172.21.52.125 at o2ib          1    NA    -1   128   128   128   128   128 0
>> 172.21.42.213 at tcp           1    NA    -1     8     8     8     8     6 0
>>
>> after configuring multi-rail I can see both infiniband interfaces peers on the OSS and on the client side. 
>> Anyway before multi-rail lustre client could mount the lustre FS without problems.
>> Now after multi-rail is set up the client cannot mount anymore the filesystem.
>>
>> When I mount lustre from the client (fstab entry):
>>
>> 172.21.42.213 at tcp:/drplu /drplu lustre noauto,lazystatfs,flock, 0 0
>>
>> the file system cannot be mounted and I got these errors
>>
>> Sep 27 18:28:46 drp-tst-lu10 kernel: [  596.842861] Lustre:
>> 2490:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has
>> failed due to network error: [sent 1506562126/real 1506562126] 
>> req at ffff8808326b2a00 x1579744801849904/t0(0)
>> o400->
>> drplu-OST0001-osc-ffff88085d134800 at 172.21.52.86@o2ib:28/4
>>  lens
>> 224/224 e 0 to 1 dl 1506562133 ref 1 fl Rpc:eXN/0/ffffffff rc 0/-1
>> Sep 27 18:28:46 drp-tst-lu10 kernel: [  596.842872] Lustre:
>> drplu-OST0001-osc-ffff88085d134800: Connection to drplu-OST0001 (at
>> 172.21.52.86 at o2ib) was lost; in progress operations using this service
>> will wait for recovery to complete
>> Sep 27 18:28:46 drp-tst-lu10 kernel: [  596.843306] Lustre:
>> drplu-OST0001-osc-ffff88085d134800: Connection restored to
>> 172.21.52.86 at o2ib (at 172.21.52.86 at o2ib)
>>
>>
>> the mount point appears and disappears every few seconds from "df"
>>
>> I do not have a clue on how to fix. The multi rail capability is important for me.
>>
>> I have Lustre 2.10.0 both client side and server side.
>> here is my lnet.conf on the lustre client side. The one OSS side is
>> similar just swapped peers for o2ib net.
>>
>> net:
>>     - net type: lo
>>       local NI(s):
>>         - nid: 0 at lo
>>           status: up
>>           statistics:
>>               send_count: 0
>>               recv_count: 0
>>               drop_count: 0
>>           tunables:
>>               peer_timeout: 0
>>               peer_credits: 0
>>               peer_buffer_credits: 0
>>               credits: 0
>>           lnd tunables:
>>           tcp bonding: 0
>>           dev cpt: 0
>>           CPT: "[0]"
>>     - net type: o2ib
>>       local NI(s):
>>         - nid: 172.21.52.124 at o2ib
>>           status: up
>>           interfaces:
>>               0: ib0
>>           statistics:
>>               send_count: 7
>>               recv_count: 7
>>               drop_count: 0
>>           tunables:
>>               peer_timeout: 180
>>               peer_credits: 128
>>               peer_buffer_credits: 0
>>               credits: 1024
>>           lnd tunables:
>>               peercredits_hiw: 64
>>               map_on_demand: 32
>>               concurrent_sends: 256
>>               fmr_pool_size: 2048
>>               fmr_flush_trigger: 512
>>               fmr_cache: 1
>>               ntx: 2048
>>               conns_per_peer: 4
>>           tcp bonding: 0
>>           dev cpt: -1
>>           CPT: "[0]"
>>         - nid: 172.21.52.125 at o2ib
>>           status: up
>>           interfaces:
>>               0: ib1
>>           statistics:
>>               send_count: 5
>>               recv_count: 5
>>               drop_count: 0
>>           tunables:
>>               peer_timeout: 180
>>               peer_credits: 128
>>               peer_buffer_credits: 0
>>               credits: 1024
>>           lnd tunables:
>>               peercredits_hiw: 64
>>               map_on_demand: 32
>>               concurrent_sends: 256
>>               fmr_pool_size: 2048
>>               fmr_flush_trigger: 512
>>               fmr_cache: 1
>>               ntx: 2048
>>               conns_per_peer: 4
>>           tcp bonding: 0
>>           dev cpt: -1
>>           CPT: "[0]"
>>     - net type: tcp
>>       local NI(s):
>>         - nid: 172.21.42.195 at tcp
>>           status: up
>>           interfaces:
>>               0: enp7s0f0
>>           statistics:
>>               send_count: 51
>>               recv_count: 51
>>               drop_count: 0
>>           tunables:
>>               peer_timeout: 180
>>               peer_credits: 8
>>               peer_buffer_credits: 0
>>               credits: 256
>>           lnd tunables:
>>           tcp bonding: 0
>>           dev cpt: -1
>>           CPT: "[0]"
>> peer:
>>     - primary nid: 172.21.42.213 at tcp
>>       Multi-Rail: False
>>       peer ni:
>>         - nid: 172.21.42.213 at tcp
>>           state: NA
>>           max_ni_tx_credits: 8
>>           available_tx_credits: 8
>>           min_tx_credits: 6
>>           tx_q_num_of_buf: 0
>>           available_rtr_credits: 8
>>           min_rtr_credits: 8
>>           send_count: 0
>>           recv_count: 0
>>           drop_count: 0
>>           refcount: 1
>>     - primary nid: 172.21.52.86 at o2ib
>>       Multi-Rail: True
>>       peer ni:
>>         - nid: 172.21.52.86 at o2ib
>>           state: NA
>>           max_ni_tx_credits: 128
>>           available_tx_credits: 128
>>           min_tx_credits: 128
>>           tx_q_num_of_buf: 0
>>           available_rtr_credits: 128
>>           min_rtr_credits: 128
>>           send_count: 0
>>           recv_count: 0
>>           drop_count: 0
>>           refcount: 1
>>         - nid: 172.21.52.118 at o2ib
>>           state: NA
>>           max_ni_tx_credits: 128
>>           available_tx_credits: 128
>>           min_tx_credits: 128
>>           tx_q_num_of_buf: 0
>>           available_rtr_credits: 128
>>           min_rtr_credits: 128
>>           send_count: 0
>>           recv_count: 0
>>           drop_count: 0
>>           refcount: 1
>>
>> thank you very much for any hint you may give.
>>
>> Rick
>>
>>
>> _______________________________________________
>> lustre-discuss mailing list
>> lustre-discuss at lists.lustre.org
>> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
> Cheers, Andreas
> --
> Andreas Dilger
> Lustre Principal Architect
> Intel Corporation
>
>
>
>
>
>
>
>



More information about the lustre-discuss mailing list