[lustre-discuss] lnet configuration messed up when clients mount lustre
Riccardo Veraldi
Riccardo.Veraldi at cnaf.infn.it
Thu Apr 19 23:17:57 PDT 2018
I figured out the problem was because of a messed up mgs partition on my
MDS.
thanks
On 4/19/18 7:18 PM, Riccardo Veraldi wrote:
> Hello,
> I have on my OSSes and on my clients the lnet configuration is loaded at
> boot time form lnet.conf
> I define local interfaces and peers.
> What happens is that when the lustre filesystems are mounted by the
> clients lnet is modified both on client and OSS side and tcp peers are
> added at the end
> of the lnet configuration and this has as a consequence that all traffic
> starts to go through TCP and not infiniband.
> I am using RHEL74 and Lustre 2.10.3 my configuration si a bit not common
> because at the same time I use kernel 4.4 on the servers while all the
> clients are stock RHEL74 kernel.
>
> Follows Lnet yaml configuration before client mounting lustre and after
> client mounting lustre partitions.
>
> seems like that auto peer discovering is overriding ib and using just tcp.
> is ther a way to stop peer auto discovery ? or a way to tell that ib has
> precedence over tcp ?
>
> lnet configuread at boot:
>
> net:
> - net type: lo
> local NI(s):
> - nid: 0 at lo
> status: up
> statistics:
> send_count: 0
> recv_count: 0
> drop_count: 0
> tunables:
> peer_timeout: 0
> peer_credits: 0
> peer_buffer_credits: 0
> credits: 0
> lnd tunables:
> tcp bonding: 0
> dev cpt: 0
> CPT: "[0,1]"
> - net type: o2ib
> local NI(s):
> - nid: 172.21.52.84 at o2ib
> status: up
> interfaces:
> 0: ib0
> statistics:
> send_count: 96252389
> recv_count: 61558248
> drop_count: 0
> tunables:
> peer_timeout: 180
> peer_credits: 128
> peer_buffer_credits: 0
> credits: 1024
> lnd tunables:
> peercredits_hiw: 64
> map_on_demand: 32
> concurrent_sends: 256
> fmr_pool_size: 2048
> fmr_flush_trigger: 512
> fmr_cache: 1
> ntx: 2048
> conns_per_peer: 4
> tcp bonding: 0
> dev cpt: 1
> CPT: "[0,1]"
> - nid: 172.21.52.116 at o2ib
> status: up
> interfaces:
> 0: ib1
> statistics:
> send_count: 96253070
> recv_count: 61558217
> drop_count: 0
> tunables:
> peer_timeout: 180
> peer_credits: 128
> peer_buffer_credits: 0
> credits: 1024
> lnd tunables:
> peercredits_hiw: 64
> map_on_demand: 32
> concurrent_sends: 256
> fmr_pool_size: 2048
> fmr_flush_trigger: 512
> fmr_cache: 1
> ntx: 2048
> conns_per_peer: 4
> tcp bonding: 0
> dev cpt: 1
> CPT: "[0,1]"
> - net type: tcp
> local NI(s):
> - nid: 172.21.42.207 at tcp
> status: up
> interfaces:
> 0: enp1s0f0
> statistics:
> send_count: 380697
> recv_count: 380352
> drop_count: 0
> tunables:
> peer_timeout: 180
> peer_credits: 8
> peer_buffer_credits: 0
> credits: 256
> lnd tunables:
> tcp bonding: 0
> dev cpt: 0
> CPT: "[0,1]"
> peer:
> - primary nid: 172.21.42.159 at tcp
> Multi-Rail: True
> peer ni:
> - nid: 172.21.42.159 at tcp
> state: NA
> max_ni_tx_credits: 8
> available_tx_credits: 8
> min_tx_credits: 0
> tx_q_num_of_buf: 0
> available_rtr_credits: 8
> min_rtr_credits: 8
> send_count: 380697
> recv_count: 380352
> drop_count: 0
> refcount: 1
> - primary nid: 172.21.52.126 at o2ib
> Multi-Rail: True
> peer ni:
> - nid: 172.21.52.126 at o2ib
> state: NA
> max_ni_tx_credits: 128
> available_tx_credits: 128
> min_tx_credits: -7
> tx_q_num_of_buf: 0
> available_rtr_credits: 128
> min_rtr_credits: 128
> send_count: 28134533
> recv_count: 8553649
> drop_count: 0
> refcount: 1
> - primary nid: 172.21.52.127 at o2ib
> Multi-Rail: True
> peer ni:
> - nid: 172.21.52.127 at o2ib
> state: NA
> max_ni_tx_credits: 128
> available_tx_credits: 128
> min_tx_credits: 97
> tx_q_num_of_buf: 0
> available_rtr_credits: 128
> min_rtr_credits: 128
> send_count: 13505518
> recv_count: 6106498
> drop_count: 0
> refcount: 1
> - primary nid: 172.21.52.128 at o2ib
> Multi-Rail: True
> peer ni:
> - nid: 172.21.52.128 at o2ib
> state: NA
> max_ni_tx_credits: 128
> available_tx_credits: 128
> min_tx_credits: -751
> tx_q_num_of_buf: 0
> available_rtr_credits: 128
> min_rtr_credits: 128
> send_count: 17672565
> recv_count: 13195155
> drop_count: 0
> refcount: 1
> - primary nid: 172.21.52.129 at o2ib
> Multi-Rail: True
> peer ni:
> - nid: 172.21.52.129 at o2ib
> state: NA
> max_ni_tx_credits: 128
> available_tx_credits: 128
> min_tx_credits: -369
> tx_q_num_of_buf: 0
> available_rtr_credits: 128
> min_rtr_credits: 128
> send_count: 13934795
> recv_count: 11409629
> drop_count: 0
> refcount: 1
> - primary nid: 172.21.52.130 at o2ib
> Multi-Rail: True
> peer ni:
> - nid: 172.21.52.130 at o2ib
> state: NA
> max_ni_tx_credits: 128
> available_tx_credits: 128
> min_tx_credits: -458
> tx_q_num_of_buf: 0
> available_rtr_credits: 128
> min_rtr_credits: 128
> send_count: 12257935
> recv_count: 11907534
> drop_count: 0
> refcount: 1
> - primary nid: 172.21.52.131 at o2ib
> Multi-Rail: True
> peer ni:
> - nid: 172.21.52.131 at o2ib
> state: NA
> max_ni_tx_credits: 128
> available_tx_credits: 128
> min_tx_credits: -417
> tx_q_num_of_buf: 0
> available_rtr_credits: 128
> min_rtr_credits: 128
> send_count: 10748675
> recv_count: 10384163
> drop_count: 0
> refcount: 1
>
> when then clients mount the lustre partitions Lnet is modified:
>
> net:
> - net type: lo
> local NI(s):
> - nid: 0 at lo
> status: up
> statistics:
> send_count: 0
> recv_count: 0
> drop_count: 0
> tunables:
> peer_timeout: 0
> peer_credits: 0
> peer_buffer_credits: 0
> credits: 0
> lnd tunables:
> tcp bonding: 0
> dev cpt: 0
> CPT: "[0,1]"
> - net type: o2ib
> local NI(s):
> - nid: 172.21.52.84 at o2ib
> status: up
> interfaces:
> 0: ib0
> statistics:
> send_count: 0
> recv_count: 0
> drop_count: 0
> tunables:
> peer_timeout: 180
> peer_credits: 128
> peer_buffer_credits: 0
> credits: 1024
> lnd tunables:
> peercredits_hiw: 64
> map_on_demand: 32
> concurrent_sends: 256
> fmr_pool_size: 2048
> fmr_flush_trigger: 512
> fmr_cache: 1
> ntx: 2048
> conns_per_peer: 4
> tcp bonding: 0
> dev cpt: 1
> CPT: "[0,1]"
> - nid: 172.21.52.116 at o2ib
> status: up
> interfaces:
> 0: ib1
> statistics:
> send_count: 0
> recv_count: 0
> drop_count: 0
> tunables:
> peer_timeout: 180
> peer_credits: 128
> peer_buffer_credits: 0
> credits: 1024
> lnd tunables:
> peercredits_hiw: 64
> map_on_demand: 32
> concurrent_sends: 256
> fmr_pool_size: 2048
> fmr_flush_trigger: 512
> fmr_cache: 1
> ntx: 2048
> conns_per_peer: 4
> tcp bonding: 0
> dev cpt: 1
> CPT: "[0,1]"
> - net type: tcp
> local NI(s):
> - nid: 172.21.42.207 at tcp
> status: up
> interfaces:
> 0: enp1s0f0
> statistics:
> send_count: 646
> recv_count: 646
> drop_count: 0
> tunables:
> peer_timeout: 180
> peer_credits: 8
> peer_buffer_credits: 0
> credits: 256
> lnd tunables:
> tcp bonding: 0
> dev cpt: 0
> CPT: "[0,1]"
> peer:
> - primary nid: 172.21.42.159 at tcp
> Multi-Rail: True
> peer ni:
> - nid: 172.21.42.159 at tcp
> state: NA
> max_ni_tx_credits: 8
> available_tx_credits: 8
> min_tx_credits: 6
> tx_q_num_of_buf: 0
> available_rtr_credits: 8
> min_rtr_credits: 8
> send_count: 268
> recv_count: 268
> drop_count: 0
> refcount: 1
> - primary nid: 172.21.52.126 at o2ib
> Multi-Rail: True
> peer ni:
> - nid: 172.21.52.126 at o2ib
> state: NA
> max_ni_tx_credits: 128
> available_tx_credits: 128
> min_tx_credits: 128
> tx_q_num_of_buf: 0
> available_rtr_credits: 128
> min_rtr_credits: 128
> send_count: 0
> recv_count: 0
> drop_count: 0
> refcount: 1
> - primary nid: 172.21.52.127 at o2ib
> Multi-Rail: True
> peer ni:
> - nid: 172.21.52.127 at o2ib
> state: NA
> max_ni_tx_credits: 128
> available_tx_credits: 128
> min_tx_credits: 128
> tx_q_num_of_buf: 0
> available_rtr_credits: 128
> min_rtr_credits: 128
> send_count: 0
> recv_count: 0
> drop_count: 0
> refcount: 1
> - primary nid: 172.21.52.128 at o2ib
> Multi-Rail: True
> peer ni:
> - nid: 172.21.52.128 at o2ib
> state: NA
> max_ni_tx_credits: 128
> available_tx_credits: 128
> min_tx_credits: 128
> tx_q_num_of_buf: 0
> available_rtr_credits: 128
> min_rtr_credits: 128
> send_count: 0
> recv_count: 0
> drop_count: 0
> refcount: 1
> - primary nid: 172.21.52.129 at o2ib
> Multi-Rail: True
> peer ni:
> - nid: 172.21.52.129 at o2ib
> state: NA
> max_ni_tx_credits: 128
> available_tx_credits: 128
> min_tx_credits: 128
> tx_q_num_of_buf: 0
> available_rtr_credits: 128
> min_rtr_credits: 128
> send_count: 0
> recv_count: 0
> drop_count: 0
> refcount: 1
> - primary nid: 172.21.52.130 at o2ib
> Multi-Rail: True
> peer ni:
> - nid: 172.21.52.130 at o2ib
> state: NA
> max_ni_tx_credits: 128
> available_tx_credits: 128
> min_tx_credits: 128
> tx_q_num_of_buf: 0
> available_rtr_credits: 128
> min_rtr_credits: 128
> send_count: 0
> recv_count: 0
> drop_count: 0
> refcount: 1
> - primary nid: 172.21.52.131 at o2ib
> Multi-Rail: True
> peer ni:
> - nid: 172.21.52.131 at o2ib
> state: NA
> max_ni_tx_credits: 128
> available_tx_credits: 128
> min_tx_credits: 128
> tx_q_num_of_buf: 0
> available_rtr_credits: 128
> min_rtr_credits: 128
> send_count: 0
> recv_count: 0
> drop_count: 0
> refcount: 1
> - primary nid: 172.21.42.224 at tcp
> Multi-Rail: False
> peer ni:
> - nid: 172.21.42.224 at tcp
> state: NA
> max_ni_tx_credits: 8
> available_tx_credits: 8
> min_tx_credits: 7
> tx_q_num_of_buf: 0
> available_rtr_credits: 8
> min_rtr_credits: 8
> send_count: 101
> recv_count: 101
> drop_count: 0
> refcount: 1
> - primary nid: 172.21.42.221 at tcp
> Multi-Rail: False
> peer ni:
> - nid: 172.21.42.221 at tcp
> state: NA
> max_ni_tx_credits: 8
> available_tx_credits: 8
> min_tx_credits: 7
> tx_q_num_of_buf: 0
> available_rtr_credits: 8
> min_rtr_credits: 8
> send_count: 20
> recv_count: 20
> drop_count: 0
> refcount: 1
> - primary nid: 172.21.42.202 at tcp
> Multi-Rail: False
> peer ni:
> - nid: 172.21.42.202 at tcp
> state: NA
> max_ni_tx_credits: 8
> available_tx_credits: 8
> min_tx_credits: 7
> tx_q_num_of_buf: 0
> available_rtr_credits: 8
> min_rtr_credits: 8
> send_count: 20
> recv_count: 20
> drop_count: 0
> refcount: 1
> - primary nid: 172.21.42.223 at tcp
> Multi-Rail: False
> peer ni:
> - nid: 172.21.42.223 at tcp
> state: NA
> max_ni_tx_credits: 8
> available_tx_credits: 8
> min_tx_credits: 7
> tx_q_num_of_buf: 0
> available_rtr_credits: 8
> min_rtr_credits: 8
> send_count: 197
> recv_count: 197
> drop_count: 0
> refcount: 1
> - primary nid: 172.21.42.222 at tcp
> Multi-Rail: False
> peer ni:
> - nid: 172.21.42.222 at tcp
> state: NA
> max_ni_tx_credits: 8
> available_tx_credits: 8
> min_tx_credits: 7
> tx_q_num_of_buf: 0
> available_rtr_credits: 8
> min_rtr_credits: 8
> send_count: 20
> recv_count: 20
> drop_count: 0
> refcount: 1
> - primary nid: 172.21.42.201 at tcp
> Multi-Rail: False
> peer ni:
> - nid: 172.21.42.201 at tcp
> state: NA
> max_ni_tx_credits: 8
> available_tx_credits: 8
> min_tx_credits: 7
> tx_q_num_of_buf: 0
> available_rtr_credits: 8
> min_rtr_credits: 8
> send_count: 20
> recv_count: 20
> drop_count: 0
> refcount: 1
> numa:
> range: 0
>
> _______________________________________________
> lustre-discuss mailing list
> lustre-discuss at lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
More information about the lustre-discuss
mailing list