[lustre-discuss] lustre client can't moutn after configuring LNET with lnetctl

Dilger, Andreas andreas.dilger at intel.com
Sat Sep 30 09:26:17 PDT 2017


On Sep 29, 2017, at 07:17, Riccardo Veraldi <Riccardo.Veraldi at cnaf.infn.it> wrote:
> 
> On 9/28/17 8:29 PM, Dilger, Andreas wrote:
>> Riccardo,
>> I'm not an LNet expert, but a number of LNet multi-rail fixes are landed or being worked on for Lustre 2.10.1.  You might try testing the current b2_10 to see if that resolves your problems.
> You are right I might end up with that. Sorry but I did not understand
> if 2.10.1 is officially out or if it is release candidate.

2.10.1 isn't officially released because of a problem we were hitting with RHEL 7.4 + OFED + DNE, but that has since been fixed.  In any case, the b2_10 branch will only get low-risk changes, and since we are at -RC1 I would expect it to be quite stable, and possibly better than what you are seeing now.

Conversely, if this _doesn't_ fix your problem, then it would be good to know about it.  We wouldn't hold up 2.10.1 for the fix I think, but it should go into 2.10.2 if possible.

Cheers, Andreas

>> 
>> On Sep 27, 2017, at 21:22, Riccardo Veraldi <Riccardo.Veraldi at cnaf.infn.it> wrote:
>>> Hello.
>>> 
>>> I configure Multi-rail on my lustre environment.
>>> 
>>> MDS: 172.21.42.213 at tcp
>>> OSS: 172.21.52.118 at o2ib
>>>        172.21.52.86 at o2ib
>>> Client: 172.21.52.124 at o2ib
>>>        172.21.52.125 at o2ib
>>> 
>>> 
>>> [root at drp-tst-oss10:~]# cat /proc/sys/lnet/peers
>>> nid                      refs state  last   max   rtr   min    tx   min
>>> queue
>>> 172.21.52.124 at o2ib          1    NA    -1   128   128   128   128   128 0
>>> 172.21.52.125 at o2ib          1    NA    -1   128   128   128   128   128 0
>>> 172.21.42.213 at tcp           1    NA    -1     8     8     8     8     6 0
>>> 
>>> after configuring multi-rail I can see both infiniband interfaces peers on the OSS and on the client side. 
>>> Anyway before multi-rail lustre client could mount the lustre FS without problems.
>>> Now after multi-rail is set up the client cannot mount anymore the filesystem.
>>> 
>>> When I mount lustre from the client (fstab entry):
>>> 
>>> 172.21.42.213 at tcp:/drplu /drplu lustre noauto,lazystatfs,flock, 0 0
>>> 
>>> the file system cannot be mounted and I got these errors
>>> 
>>> Sep 27 18:28:46 drp-tst-lu10 kernel: [  596.842861] Lustre:
>>> 2490:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has
>>> failed due to network error: [sent 1506562126/real 1506562126] 
>>> req at ffff8808326b2a00 x1579744801849904/t0(0)
>>> o400->
>>> drplu-OST0001-osc-ffff88085d134800 at 172.21.52.86@o2ib:28/4
>>> lens
>>> 224/224 e 0 to 1 dl 1506562133 ref 1 fl Rpc:eXN/0/ffffffff rc 0/-1
>>> Sep 27 18:28:46 drp-tst-lu10 kernel: [  596.842872] Lustre:
>>> drplu-OST0001-osc-ffff88085d134800: Connection to drplu-OST0001 (at
>>> 172.21.52.86 at o2ib) was lost; in progress operations using this service
>>> will wait for recovery to complete
>>> Sep 27 18:28:46 drp-tst-lu10 kernel: [  596.843306] Lustre:
>>> drplu-OST0001-osc-ffff88085d134800: Connection restored to
>>> 172.21.52.86 at o2ib (at 172.21.52.86 at o2ib)
>>> 
>>> 
>>> the mount point appears and disappears every few seconds from "df"
>>> 
>>> I do not have a clue on how to fix. The multi rail capability is important for me.
>>> 
>>> I have Lustre 2.10.0 both client side and server side.
>>> here is my lnet.conf on the lustre client side. The one OSS side is
>>> similar just swapped peers for o2ib net.
>>> 
>>> net:
>>>    - net type: lo
>>>      local NI(s):
>>>        - nid: 0 at lo
>>>          status: up
>>>          statistics:
>>>              send_count: 0
>>>              recv_count: 0
>>>              drop_count: 0
>>>          tunables:
>>>              peer_timeout: 0
>>>              peer_credits: 0
>>>              peer_buffer_credits: 0
>>>              credits: 0
>>>          lnd tunables:
>>>          tcp bonding: 0
>>>          dev cpt: 0
>>>          CPT: "[0]"
>>>    - net type: o2ib
>>>      local NI(s):
>>>        - nid: 172.21.52.124 at o2ib
>>>          status: up
>>>          interfaces:
>>>              0: ib0
>>>          statistics:
>>>              send_count: 7
>>>              recv_count: 7
>>>              drop_count: 0
>>>          tunables:
>>>              peer_timeout: 180
>>>              peer_credits: 128
>>>              peer_buffer_credits: 0
>>>              credits: 1024
>>>          lnd tunables:
>>>              peercredits_hiw: 64
>>>              map_on_demand: 32
>>>              concurrent_sends: 256
>>>              fmr_pool_size: 2048
>>>              fmr_flush_trigger: 512
>>>              fmr_cache: 1
>>>              ntx: 2048
>>>              conns_per_peer: 4
>>>          tcp bonding: 0
>>>          dev cpt: -1
>>>          CPT: "[0]"
>>>        - nid: 172.21.52.125 at o2ib
>>>          status: up
>>>          interfaces:
>>>              0: ib1
>>>          statistics:
>>>              send_count: 5
>>>              recv_count: 5
>>>              drop_count: 0
>>>          tunables:
>>>              peer_timeout: 180
>>>              peer_credits: 128
>>>              peer_buffer_credits: 0
>>>              credits: 1024
>>>          lnd tunables:
>>>              peercredits_hiw: 64
>>>              map_on_demand: 32
>>>              concurrent_sends: 256
>>>              fmr_pool_size: 2048
>>>              fmr_flush_trigger: 512
>>>              fmr_cache: 1
>>>              ntx: 2048
>>>              conns_per_peer: 4
>>>          tcp bonding: 0
>>>          dev cpt: -1
>>>          CPT: "[0]"
>>>    - net type: tcp
>>>      local NI(s):
>>>        - nid: 172.21.42.195 at tcp
>>>          status: up
>>>          interfaces:
>>>              0: enp7s0f0
>>>          statistics:
>>>              send_count: 51
>>>              recv_count: 51
>>>              drop_count: 0
>>>          tunables:
>>>              peer_timeout: 180
>>>              peer_credits: 8
>>>              peer_buffer_credits: 0
>>>              credits: 256
>>>          lnd tunables:
>>>          tcp bonding: 0
>>>          dev cpt: -1
>>>          CPT: "[0]"
>>> peer:
>>>    - primary nid: 172.21.42.213 at tcp
>>>      Multi-Rail: False
>>>      peer ni:
>>>        - nid: 172.21.42.213 at tcp
>>>          state: NA
>>>          max_ni_tx_credits: 8
>>>          available_tx_credits: 8
>>>          min_tx_credits: 6
>>>          tx_q_num_of_buf: 0
>>>          available_rtr_credits: 8
>>>          min_rtr_credits: 8
>>>          send_count: 0
>>>          recv_count: 0
>>>          drop_count: 0
>>>          refcount: 1
>>>    - primary nid: 172.21.52.86 at o2ib
>>>      Multi-Rail: True
>>>      peer ni:
>>>        - nid: 172.21.52.86 at o2ib
>>>          state: NA
>>>          max_ni_tx_credits: 128
>>>          available_tx_credits: 128
>>>          min_tx_credits: 128
>>>          tx_q_num_of_buf: 0
>>>          available_rtr_credits: 128
>>>          min_rtr_credits: 128
>>>          send_count: 0
>>>          recv_count: 0
>>>          drop_count: 0
>>>          refcount: 1
>>>        - nid: 172.21.52.118 at o2ib
>>>          state: NA
>>>          max_ni_tx_credits: 128
>>>          available_tx_credits: 128
>>>          min_tx_credits: 128
>>>          tx_q_num_of_buf: 0
>>>          available_rtr_credits: 128
>>>          min_rtr_credits: 128
>>>          send_count: 0
>>>          recv_count: 0
>>>          drop_count: 0
>>>          refcount: 1
>>> 
>>> thank you very much for any hint you may give.
>>> 
>>> Rick
>>> 
>>> 
>>> _______________________________________________
>>> lustre-discuss mailing list
>>> lustre-discuss at lists.lustre.org
>>> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>> Cheers, Andreas
>> --
>> Andreas Dilger
>> Lustre Principal Architect
>> Intel Corporation
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
> 

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Intel Corporation









More information about the lustre-discuss mailing list