[lustre-discuss] lustre client can't moutn after configuring LNET with lnetctl
Dilger, Andreas
andreas.dilger at intel.com
Sat Sep 30 09:26:17 PDT 2017
On Sep 29, 2017, at 07:17, Riccardo Veraldi <Riccardo.Veraldi at cnaf.infn.it> wrote:
>
> On 9/28/17 8:29 PM, Dilger, Andreas wrote:
>> Riccardo,
>> I'm not an LNet expert, but a number of LNet multi-rail fixes are landed or being worked on for Lustre 2.10.1. You might try testing the current b2_10 to see if that resolves your problems.
> You are right I might end up with that. Sorry but I did not understand
> if 2.10.1 is officially out or if it is release candidate.
2.10.1 isn't officially released because of a problem we were hitting with RHEL 7.4 + OFED + DNE, but that has since been fixed. In any case, the b2_10 branch will only get low-risk changes, and since we are at -RC1 I would expect it to be quite stable, and possibly better than what you are seeing now.
Conversely, if this _doesn't_ fix your problem, then it would be good to know about it. We wouldn't hold up 2.10.1 for the fix I think, but it should go into 2.10.2 if possible.
Cheers, Andreas
>>
>> On Sep 27, 2017, at 21:22, Riccardo Veraldi <Riccardo.Veraldi at cnaf.infn.it> wrote:
>>> Hello.
>>>
>>> I configure Multi-rail on my lustre environment.
>>>
>>> MDS: 172.21.42.213 at tcp
>>> OSS: 172.21.52.118 at o2ib
>>> 172.21.52.86 at o2ib
>>> Client: 172.21.52.124 at o2ib
>>> 172.21.52.125 at o2ib
>>>
>>>
>>> [root at drp-tst-oss10:~]# cat /proc/sys/lnet/peers
>>> nid refs state last max rtr min tx min
>>> queue
>>> 172.21.52.124 at o2ib 1 NA -1 128 128 128 128 128 0
>>> 172.21.52.125 at o2ib 1 NA -1 128 128 128 128 128 0
>>> 172.21.42.213 at tcp 1 NA -1 8 8 8 8 6 0
>>>
>>> after configuring multi-rail I can see both infiniband interfaces peers on the OSS and on the client side.
>>> Anyway before multi-rail lustre client could mount the lustre FS without problems.
>>> Now after multi-rail is set up the client cannot mount anymore the filesystem.
>>>
>>> When I mount lustre from the client (fstab entry):
>>>
>>> 172.21.42.213 at tcp:/drplu /drplu lustre noauto,lazystatfs,flock, 0 0
>>>
>>> the file system cannot be mounted and I got these errors
>>>
>>> Sep 27 18:28:46 drp-tst-lu10 kernel: [ 596.842861] Lustre:
>>> 2490:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has
>>> failed due to network error: [sent 1506562126/real 1506562126]
>>> req at ffff8808326b2a00 x1579744801849904/t0(0)
>>> o400->
>>> drplu-OST0001-osc-ffff88085d134800 at 172.21.52.86@o2ib:28/4
>>> lens
>>> 224/224 e 0 to 1 dl 1506562133 ref 1 fl Rpc:eXN/0/ffffffff rc 0/-1
>>> Sep 27 18:28:46 drp-tst-lu10 kernel: [ 596.842872] Lustre:
>>> drplu-OST0001-osc-ffff88085d134800: Connection to drplu-OST0001 (at
>>> 172.21.52.86 at o2ib) was lost; in progress operations using this service
>>> will wait for recovery to complete
>>> Sep 27 18:28:46 drp-tst-lu10 kernel: [ 596.843306] Lustre:
>>> drplu-OST0001-osc-ffff88085d134800: Connection restored to
>>> 172.21.52.86 at o2ib (at 172.21.52.86 at o2ib)
>>>
>>>
>>> the mount point appears and disappears every few seconds from "df"
>>>
>>> I do not have a clue on how to fix. The multi rail capability is important for me.
>>>
>>> I have Lustre 2.10.0 both client side and server side.
>>> here is my lnet.conf on the lustre client side. The one OSS side is
>>> similar just swapped peers for o2ib net.
>>>
>>> net:
>>> - net type: lo
>>> local NI(s):
>>> - nid: 0 at lo
>>> status: up
>>> statistics:
>>> send_count: 0
>>> recv_count: 0
>>> drop_count: 0
>>> tunables:
>>> peer_timeout: 0
>>> peer_credits: 0
>>> peer_buffer_credits: 0
>>> credits: 0
>>> lnd tunables:
>>> tcp bonding: 0
>>> dev cpt: 0
>>> CPT: "[0]"
>>> - net type: o2ib
>>> local NI(s):
>>> - nid: 172.21.52.124 at o2ib
>>> status: up
>>> interfaces:
>>> 0: ib0
>>> statistics:
>>> send_count: 7
>>> recv_count: 7
>>> drop_count: 0
>>> tunables:
>>> peer_timeout: 180
>>> peer_credits: 128
>>> peer_buffer_credits: 0
>>> credits: 1024
>>> lnd tunables:
>>> peercredits_hiw: 64
>>> map_on_demand: 32
>>> concurrent_sends: 256
>>> fmr_pool_size: 2048
>>> fmr_flush_trigger: 512
>>> fmr_cache: 1
>>> ntx: 2048
>>> conns_per_peer: 4
>>> tcp bonding: 0
>>> dev cpt: -1
>>> CPT: "[0]"
>>> - nid: 172.21.52.125 at o2ib
>>> status: up
>>> interfaces:
>>> 0: ib1
>>> statistics:
>>> send_count: 5
>>> recv_count: 5
>>> drop_count: 0
>>> tunables:
>>> peer_timeout: 180
>>> peer_credits: 128
>>> peer_buffer_credits: 0
>>> credits: 1024
>>> lnd tunables:
>>> peercredits_hiw: 64
>>> map_on_demand: 32
>>> concurrent_sends: 256
>>> fmr_pool_size: 2048
>>> fmr_flush_trigger: 512
>>> fmr_cache: 1
>>> ntx: 2048
>>> conns_per_peer: 4
>>> tcp bonding: 0
>>> dev cpt: -1
>>> CPT: "[0]"
>>> - net type: tcp
>>> local NI(s):
>>> - nid: 172.21.42.195 at tcp
>>> status: up
>>> interfaces:
>>> 0: enp7s0f0
>>> statistics:
>>> send_count: 51
>>> recv_count: 51
>>> drop_count: 0
>>> tunables:
>>> peer_timeout: 180
>>> peer_credits: 8
>>> peer_buffer_credits: 0
>>> credits: 256
>>> lnd tunables:
>>> tcp bonding: 0
>>> dev cpt: -1
>>> CPT: "[0]"
>>> peer:
>>> - primary nid: 172.21.42.213 at tcp
>>> Multi-Rail: False
>>> peer ni:
>>> - nid: 172.21.42.213 at tcp
>>> state: NA
>>> max_ni_tx_credits: 8
>>> available_tx_credits: 8
>>> min_tx_credits: 6
>>> tx_q_num_of_buf: 0
>>> available_rtr_credits: 8
>>> min_rtr_credits: 8
>>> send_count: 0
>>> recv_count: 0
>>> drop_count: 0
>>> refcount: 1
>>> - primary nid: 172.21.52.86 at o2ib
>>> Multi-Rail: True
>>> peer ni:
>>> - nid: 172.21.52.86 at o2ib
>>> state: NA
>>> max_ni_tx_credits: 128
>>> available_tx_credits: 128
>>> min_tx_credits: 128
>>> tx_q_num_of_buf: 0
>>> available_rtr_credits: 128
>>> min_rtr_credits: 128
>>> send_count: 0
>>> recv_count: 0
>>> drop_count: 0
>>> refcount: 1
>>> - nid: 172.21.52.118 at o2ib
>>> state: NA
>>> max_ni_tx_credits: 128
>>> available_tx_credits: 128
>>> min_tx_credits: 128
>>> tx_q_num_of_buf: 0
>>> available_rtr_credits: 128
>>> min_rtr_credits: 128
>>> send_count: 0
>>> recv_count: 0
>>> drop_count: 0
>>> refcount: 1
>>>
>>> thank you very much for any hint you may give.
>>>
>>> Rick
>>>
>>>
>>> _______________________________________________
>>> lustre-discuss mailing list
>>> lustre-discuss at lists.lustre.org
>>> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>> Cheers, Andreas
>> --
>> Andreas Dilger
>> Lustre Principal Architect
>> Intel Corporation
>>
>>
>>
>>
>>
>>
>>
>>
>
Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Intel Corporation
More information about the lustre-discuss
mailing list