[lustre-discuss] lnet configuration messed up when clients mount lustre

Riccardo Veraldi Riccardo.Veraldi at cnaf.infn.it
Thu Apr 19 23:17:57 PDT 2018


I figured out the problem was because of a messed up mgs partition on my
MDS.
thanks

On 4/19/18 7:18 PM, Riccardo Veraldi wrote:
> Hello,
> I have on my OSSes and on my clients the lnet configuration is loaded at
> boot time form lnet.conf
> I define local interfaces and peers.
> What happens is that when the lustre filesystems are mounted by the
> clients lnet is modified both on client and OSS side  and tcp peers are
> added at the end
> of the lnet configuration and this has as a consequence that all traffic
> starts to go through TCP and not infiniband.
> I am using RHEL74 and Lustre 2.10.3 my configuration si a bit not common
> because at the same time I use kernel 4.4 on the servers while all the
> clients are stock RHEL74 kernel.
>
> Follows Lnet yaml configuration before client mounting lustre and after
> client mounting lustre partitions.
>
> seems like that auto peer discovering is overriding ib and using just tcp.
> is ther a way to stop peer auto discovery ? or a way to tell that ib has
> precedence over tcp ?
>
> lnet configuread at boot:
>
> net:
>     - net type: lo
>       local NI(s):
>         - nid: 0 at lo
>           status: up
>           statistics:
>               send_count: 0
>               recv_count: 0
>               drop_count: 0
>           tunables:
>               peer_timeout: 0
>               peer_credits: 0
>               peer_buffer_credits: 0
>               credits: 0
>           lnd tunables:
>           tcp bonding: 0
>           dev cpt: 0
>           CPT: "[0,1]"
>     - net type: o2ib
>       local NI(s):
>         - nid: 172.21.52.84 at o2ib
>           status: up
>           interfaces:
>               0: ib0
>           statistics:
>               send_count: 96252389
>               recv_count: 61558248
>               drop_count: 0
>           tunables:
>               peer_timeout: 180
>               peer_credits: 128
>               peer_buffer_credits: 0
>               credits: 1024
>           lnd tunables:
>               peercredits_hiw: 64
>               map_on_demand: 32
>               concurrent_sends: 256
>               fmr_pool_size: 2048
>               fmr_flush_trigger: 512
>               fmr_cache: 1
>               ntx: 2048
>               conns_per_peer: 4
>           tcp bonding: 0
>           dev cpt: 1
>           CPT: "[0,1]"
>         - nid: 172.21.52.116 at o2ib
>           status: up
>           interfaces:
>               0: ib1
>           statistics:
>               send_count: 96253070
>               recv_count: 61558217
>               drop_count: 0
>           tunables:
>               peer_timeout: 180
>               peer_credits: 128
>               peer_buffer_credits: 0
>               credits: 1024
>           lnd tunables:
>               peercredits_hiw: 64
>               map_on_demand: 32
>               concurrent_sends: 256
>               fmr_pool_size: 2048
>               fmr_flush_trigger: 512
>               fmr_cache: 1
>               ntx: 2048
>               conns_per_peer: 4
>           tcp bonding: 0
>           dev cpt: 1
>           CPT: "[0,1]"
>     - net type: tcp
>       local NI(s):
>         - nid: 172.21.42.207 at tcp
>           status: up
>           interfaces:
>               0: enp1s0f0
>           statistics:
>               send_count: 380697
>               recv_count: 380352
>               drop_count: 0
>           tunables:
>               peer_timeout: 180
>               peer_credits: 8
>               peer_buffer_credits: 0
>               credits: 256
>           lnd tunables:
>           tcp bonding: 0
>           dev cpt: 0
>           CPT: "[0,1]"
> peer:
>     - primary nid: 172.21.42.159 at tcp
>       Multi-Rail: True
>       peer ni:
>         - nid: 172.21.42.159 at tcp
>           state: NA
>           max_ni_tx_credits: 8
>           available_tx_credits: 8
>           min_tx_credits: 0
>           tx_q_num_of_buf: 0
>           available_rtr_credits: 8
>           min_rtr_credits: 8
>           send_count: 380697
>           recv_count: 380352
>           drop_count: 0
>           refcount: 1
>     - primary nid: 172.21.52.126 at o2ib
>       Multi-Rail: True
>       peer ni:
>         - nid: 172.21.52.126 at o2ib
>           state: NA
>           max_ni_tx_credits: 128
>           available_tx_credits: 128
>           min_tx_credits: -7
>           tx_q_num_of_buf: 0
>           available_rtr_credits: 128
>           min_rtr_credits: 128
>           send_count: 28134533
>           recv_count: 8553649
>           drop_count: 0
>           refcount: 1
>     - primary nid: 172.21.52.127 at o2ib
>       Multi-Rail: True
>       peer ni:
>         - nid: 172.21.52.127 at o2ib
>           state: NA
>           max_ni_tx_credits: 128
>           available_tx_credits: 128
>           min_tx_credits: 97
>           tx_q_num_of_buf: 0
>           available_rtr_credits: 128
>           min_rtr_credits: 128
>           send_count: 13505518
>           recv_count: 6106498
>           drop_count: 0
>           refcount: 1
>     - primary nid: 172.21.52.128 at o2ib
>       Multi-Rail: True
>       peer ni:
>         - nid: 172.21.52.128 at o2ib
>           state: NA
>           max_ni_tx_credits: 128
>           available_tx_credits: 128
>           min_tx_credits: -751
>           tx_q_num_of_buf: 0
>           available_rtr_credits: 128
>           min_rtr_credits: 128
>           send_count: 17672565
>           recv_count: 13195155
>           drop_count: 0
>           refcount: 1
>     - primary nid: 172.21.52.129 at o2ib
>       Multi-Rail: True
>       peer ni:
>         - nid: 172.21.52.129 at o2ib
>           state: NA
>           max_ni_tx_credits: 128
>           available_tx_credits: 128
>           min_tx_credits: -369
>           tx_q_num_of_buf: 0
>           available_rtr_credits: 128
>           min_rtr_credits: 128
>           send_count: 13934795
>           recv_count: 11409629
>           drop_count: 0
>           refcount: 1
>     - primary nid: 172.21.52.130 at o2ib
>       Multi-Rail: True
>       peer ni:
>         - nid: 172.21.52.130 at o2ib
>           state: NA
>           max_ni_tx_credits: 128
>           available_tx_credits: 128
>           min_tx_credits: -458
>           tx_q_num_of_buf: 0
>           available_rtr_credits: 128
>           min_rtr_credits: 128
>           send_count: 12257935
>           recv_count: 11907534
>           drop_count: 0
>           refcount: 1
>     - primary nid: 172.21.52.131 at o2ib
>       Multi-Rail: True
>       peer ni:
>         - nid: 172.21.52.131 at o2ib
>           state: NA
>           max_ni_tx_credits: 128
>           available_tx_credits: 128
>           min_tx_credits: -417
>           tx_q_num_of_buf: 0
>           available_rtr_credits: 128
>           min_rtr_credits: 128
>           send_count: 10748675
>           recv_count: 10384163
>           drop_count: 0
>           refcount: 1
>
> when then clients mount the lustre partitions Lnet is modified:
>
> net:
>     - net type: lo
>       local NI(s):
>         - nid: 0 at lo
>           status: up
>           statistics:
>               send_count: 0
>               recv_count: 0
>               drop_count: 0
>           tunables:
>               peer_timeout: 0
>               peer_credits: 0
>               peer_buffer_credits: 0
>               credits: 0
>           lnd tunables:
>           tcp bonding: 0
>           dev cpt: 0
>           CPT: "[0,1]"
>     - net type: o2ib
>       local NI(s):
>         - nid: 172.21.52.84 at o2ib
>           status: up
>           interfaces:
>               0: ib0
>           statistics:
>               send_count: 0
>               recv_count: 0
>               drop_count: 0
>           tunables:
>               peer_timeout: 180
>               peer_credits: 128
>               peer_buffer_credits: 0
>               credits: 1024
>           lnd tunables:
>               peercredits_hiw: 64
>               map_on_demand: 32
>               concurrent_sends: 256
>               fmr_pool_size: 2048
>               fmr_flush_trigger: 512
>               fmr_cache: 1
>               ntx: 2048
>               conns_per_peer: 4
>           tcp bonding: 0
>           dev cpt: 1
>           CPT: "[0,1]"
>         - nid: 172.21.52.116 at o2ib
>           status: up
>           interfaces:
>               0: ib1
>           statistics:
>               send_count: 0
>               recv_count: 0
>               drop_count: 0
>           tunables:
>               peer_timeout: 180
>               peer_credits: 128
>               peer_buffer_credits: 0
>               credits: 1024
>           lnd tunables:
>               peercredits_hiw: 64
>               map_on_demand: 32
>               concurrent_sends: 256
>               fmr_pool_size: 2048
>               fmr_flush_trigger: 512
>               fmr_cache: 1
>               ntx: 2048
>               conns_per_peer: 4
>           tcp bonding: 0
>           dev cpt: 1
>           CPT: "[0,1]"
>     - net type: tcp
>       local NI(s):
>         - nid: 172.21.42.207 at tcp
>           status: up
>           interfaces:
>               0: enp1s0f0
>           statistics:
>               send_count: 646
>               recv_count: 646
>               drop_count: 0
>           tunables:
>               peer_timeout: 180
>               peer_credits: 8
>               peer_buffer_credits: 0
>               credits: 256
>           lnd tunables:
>           tcp bonding: 0
>           dev cpt: 0
>           CPT: "[0,1]"
> peer:
>     - primary nid: 172.21.42.159 at tcp
>       Multi-Rail: True
>       peer ni:
>         - nid: 172.21.42.159 at tcp
>           state: NA
>           max_ni_tx_credits: 8
>           available_tx_credits: 8
>           min_tx_credits: 6
>           tx_q_num_of_buf: 0
>           available_rtr_credits: 8
>           min_rtr_credits: 8
>           send_count: 268
>           recv_count: 268
>           drop_count: 0
>           refcount: 1
>     - primary nid: 172.21.52.126 at o2ib
>       Multi-Rail: True
>       peer ni:
>         - nid: 172.21.52.126 at o2ib
>           state: NA
>           max_ni_tx_credits: 128
>           available_tx_credits: 128
>           min_tx_credits: 128
>           tx_q_num_of_buf: 0
>           available_rtr_credits: 128
>           min_rtr_credits: 128
>           send_count: 0
>           recv_count: 0
>           drop_count: 0
>           refcount: 1
>     - primary nid: 172.21.52.127 at o2ib
>       Multi-Rail: True
>       peer ni:
>         - nid: 172.21.52.127 at o2ib
>           state: NA
>           max_ni_tx_credits: 128
>           available_tx_credits: 128
>           min_tx_credits: 128
>           tx_q_num_of_buf: 0
>           available_rtr_credits: 128
>           min_rtr_credits: 128
>           send_count: 0
>           recv_count: 0
>           drop_count: 0
>           refcount: 1
>     - primary nid: 172.21.52.128 at o2ib
>       Multi-Rail: True
>       peer ni:
>         - nid: 172.21.52.128 at o2ib
>           state: NA
>           max_ni_tx_credits: 128
>           available_tx_credits: 128
>           min_tx_credits: 128
>           tx_q_num_of_buf: 0
>           available_rtr_credits: 128
>           min_rtr_credits: 128
>           send_count: 0
>           recv_count: 0
>           drop_count: 0
>           refcount: 1
>     - primary nid: 172.21.52.129 at o2ib
>       Multi-Rail: True
>       peer ni:
>         - nid: 172.21.52.129 at o2ib
>           state: NA
>           max_ni_tx_credits: 128
>           available_tx_credits: 128
>           min_tx_credits: 128
>           tx_q_num_of_buf: 0
>           available_rtr_credits: 128
>           min_rtr_credits: 128
>           send_count: 0
>           recv_count: 0
>           drop_count: 0
>           refcount: 1
>     - primary nid: 172.21.52.130 at o2ib
>       Multi-Rail: True
>       peer ni:
>         - nid: 172.21.52.130 at o2ib
>           state: NA
>           max_ni_tx_credits: 128
>           available_tx_credits: 128
>           min_tx_credits: 128
>           tx_q_num_of_buf: 0
>           available_rtr_credits: 128
>           min_rtr_credits: 128
>           send_count: 0
>           recv_count: 0
>           drop_count: 0
>           refcount: 1
>     - primary nid: 172.21.52.131 at o2ib
>       Multi-Rail: True
>       peer ni:
>         - nid: 172.21.52.131 at o2ib
>           state: NA
>           max_ni_tx_credits: 128
>           available_tx_credits: 128
>           min_tx_credits: 128
>           tx_q_num_of_buf: 0
>           available_rtr_credits: 128
>           min_rtr_credits: 128
>           send_count: 0
>           recv_count: 0
>           drop_count: 0
>           refcount: 1
>     - primary nid: 172.21.42.224 at tcp
>       Multi-Rail: False
>       peer ni:
>         - nid: 172.21.42.224 at tcp
>           state: NA
>           max_ni_tx_credits: 8
>           available_tx_credits: 8
>           min_tx_credits: 7
>           tx_q_num_of_buf: 0
>           available_rtr_credits: 8
>           min_rtr_credits: 8
>           send_count: 101
>           recv_count: 101
>           drop_count: 0
>           refcount: 1
>     - primary nid: 172.21.42.221 at tcp
>       Multi-Rail: False
>       peer ni:
>         - nid: 172.21.42.221 at tcp
>           state: NA
>           max_ni_tx_credits: 8
>           available_tx_credits: 8
>           min_tx_credits: 7
>           tx_q_num_of_buf: 0
>           available_rtr_credits: 8
>           min_rtr_credits: 8
>           send_count: 20
>           recv_count: 20
>           drop_count: 0
>           refcount: 1
>     - primary nid: 172.21.42.202 at tcp
>       Multi-Rail: False
>       peer ni:
>         - nid: 172.21.42.202 at tcp
>           state: NA
>           max_ni_tx_credits: 8
>           available_tx_credits: 8
>           min_tx_credits: 7
>           tx_q_num_of_buf: 0
>           available_rtr_credits: 8
>           min_rtr_credits: 8
>           send_count: 20
>           recv_count: 20
>           drop_count: 0
>           refcount: 1
>     - primary nid: 172.21.42.223 at tcp
>       Multi-Rail: False
>       peer ni:
>         - nid: 172.21.42.223 at tcp
>           state: NA
>           max_ni_tx_credits: 8
>           available_tx_credits: 8
>           min_tx_credits: 7
>           tx_q_num_of_buf: 0
>           available_rtr_credits: 8
>           min_rtr_credits: 8
>           send_count: 197
>           recv_count: 197
>           drop_count: 0
>           refcount: 1
>     - primary nid: 172.21.42.222 at tcp
>       Multi-Rail: False
>       peer ni:
>         - nid: 172.21.42.222 at tcp
>           state: NA
>           max_ni_tx_credits: 8
>           available_tx_credits: 8
>           min_tx_credits: 7
>           tx_q_num_of_buf: 0
>           available_rtr_credits: 8
>           min_rtr_credits: 8
>           send_count: 20
>           recv_count: 20
>           drop_count: 0
>           refcount: 1
>     - primary nid: 172.21.42.201 at tcp
>       Multi-Rail: False
>       peer ni:
>         - nid: 172.21.42.201 at tcp
>           state: NA
>           max_ni_tx_credits: 8
>           available_tx_credits: 8
>           min_tx_credits: 7
>           tx_q_num_of_buf: 0
>           available_rtr_credits: 8
>           min_rtr_credits: 8
>           send_count: 20
>           recv_count: 20
>           drop_count: 0
>           refcount: 1
> numa:
>     range: 0
>
> _______________________________________________
> lustre-discuss mailing list
> lustre-discuss at lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org




More information about the lustre-discuss mailing list