[lustre-discuss] Lustre 2.10.1 error while mounting multi-rail
Riccardo Veraldi
Riccardo.Veraldi at cnaf.infn.it
Mon Oct 9 16:44:00 PDT 2017
Hello.
Here I am again trying to have multi-rail work.
I configured multi-rail on OSS and clients side.
I have one OSS, one MDS and one client, RHEL74 and Lustre 2.10.1:
* psdrp-tst-mds10 MDS
* drp-tst-oss10 OSS (172.21.52.86 at o2ib 172.21.52.118 at o2ib)
* drp-tst-lu10 Lustre client (172.21.52.124 at o2ib 172.21.52.125 at o2ib)
without Multi-Rail everything works fine.
What I Am doing is to aggregate two IB interface to being able to have
more performance. When anyway I mount the lustre partition from the
Lsutre client I got this error and the partition does not mount:
Oct 9 16:23:50 drp-tst-lu10 kernel: [248177.914832] LNetError:
1895:0:(o2iblnd_cb.c:2726:kiblnd_rejected()) 172.21.52.118 at o2ib
rejected: consumer defined fatal error
Oct 9 16:23:50 drp-tst-lu10 kernel: [248177.917290] Lustre: Mounted
drplu-client
Oct 9 16:23:50 drp-tst-lu10 kernel: [248177.920832] Lustre:
31785:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has
failed due to network error: [sent 1507591430/real 1507591430]
req at ffff8807f56a0300 x1580812428378832/t0(0)
o8->drplu-OST0001-osc-ffff88084738d800 at 172.21.52.86@o2ib:28/4 lens
520/544 e 0 to 1 dl 1507591435 ref 1 fl Rpc:eXN/0/ffffffff rc 0/-1
Oct 9 16:23:52 drp-tst-lu10 kernel: [248179.936156] LustreError:
673:0:(llite_lib.c:1748:ll_statfs_internal()) obd_statfs fails: rc = -5
Oct 9 16:23:57 drp-tst-lu10 kernel: [248184.645463] LustreError:
674:0:(llite_lib.c:1748:ll_statfs_internal()) obd_statfs fails: rc = -5
Oct 9 16:23:58 drp-tst-lu10 kernel: [248186.117364] LustreError:
678:0:(llite_lib.c:1748:ll_statfs_internal()) obd_statfs fails: rc = -5
Oct 9 16:23:58 drp-tst-lu10 kernel: [248186.117411] LustreError:
678:0:(llite_lib.c:1748:ll_statfs_internal()) Skipped 1 previous similar
message
Oct 9 16:24:15 drp-tst-lu10 kernel: [248202.912554] LNetError:
1895:0:(o2iblnd_cb.c:2726:kiblnd_rejected()) 172.21.52.118 at o2ib
rejected: consumer defined fatal error
Oct 9 16:24:15 drp-tst-lu10 kernel: [248202.912610] LNetError:
1895:0:(o2iblnd_cb.c:2726:kiblnd_rejected()) Skipped 3 previous similar
messages
Oct 9 16:24:15 drp-tst-lu10 kernel: [248202.918903] Lustre:
31785:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has
failed due to network error: [sent 1507591455/real 1507591455]
req at ffff88075d2ee700 x1580812428378960/t0(0)
o8->drplu-OST0001-osc-ffff88084738d800 at 172.21.52.86@o2ib:28/4 lens
520/544 e 0 to 1 dl 1507591465 ref 1 fl Rpc:eXN/0/ffffffff rc 0/-1
Oct 9 16:23:52 drp-tst-lu10 kernel: [248179.936156] LustreError:
673:0:(llite_lib.c:1748:ll_statfs_internal()) obd_statfs fails: rc = -5
fstab entry: 172.21.42.213 at tcp:/drplu /drplu lustre
noauto,lazystatfs,flock, 0 0
I can see the peers in the lnet status:
[root at drp-tst-oss10:~]# cat /proc/sys/lnet/peers
nid refs state last max rtr min tx min
queue
172.21.52.124 at o2ib 1 NA -1 128 128 128 128 128 0
172.21.52.125 at o2ib 1 NA -1 128 128 128 128 128 0
172.21.42.213 at tcp 1 NA -1 8 8 8 8 6 0
[root at drp-tst-lu10:etc]# cat /proc/sys/lnet/peers
nid refs state last max rtr min tx min
queue
172.21.52.118 at o2ib 1 NA -1 128 128 128 128 127 0
172.21.52.86 at o2ib 1 NA -1 128 128 128 128 102 0
172.21.42.213 at tcp 1 NA -1 8 8 8 8 6 0
here is my lnet configuration with multi-rail on the OSS side
[root at drp-tst-oss10:veraldi]# lnetctl export
net:
- net type: lo
local NI(s):
- nid: 0 at lo
status: up
statistics:
send_count: 0
recv_count: 0
drop_count: 0
tunables:
peer_timeout: 0
peer_credits: 0
peer_buffer_credits: 0
credits: 0
lnd tunables:
tcp bonding: 0
dev cpt: 0
CPT: "[0,1]"
- net type: o2ib
local NI(s):
- nid: 172.21.52.86 at o2ib
status: up
interfaces:
0: ib0
statistics:
send_count: 0
recv_count: 0
drop_count: 0
tunables:
peer_timeout: 180
peer_credits: 128
peer_buffer_credits: 0
credits: 1024
lnd tunables:
peercredits_hiw: 64
map_on_demand: 32
concurrent_sends: 256
fmr_pool_size: 2048
fmr_flush_trigger: 512
fmr_cache: 1
ntx: 2048
conns_per_peer: 4
tcp bonding: 0
dev cpt: 1
CPT: "[0,1]"
- nid: 172.21.52.118 at o2ib
status: up
interfaces:
0: ib1
statistics:
send_count: 0
recv_count: 0
drop_count: 0
tunables:
peer_timeout: 180
peer_credits: 128
peer_buffer_credits: 0
credits: 1024
lnd tunables:
peercredits_hiw: 64
map_on_demand: 32
concurrent_sends: 256
fmr_pool_size: 2048
fmr_flush_trigger: 512
fmr_cache: 1
ntx: 2048
conns_per_peer: 4
tcp bonding: 0
dev cpt: 1
CPT: "[0,1]"
- net type: tcp
local NI(s):
- nid: 172.21.42.211 at tcp
status: up
interfaces:
0: enp1s0f0
statistics:
send_count: 198
recv_count: 198
drop_count: 0
tunables:
peer_timeout: 180
peer_credits: 8
peer_buffer_credits: 0
credits: 256
lnd tunables:
tcp bonding: 0
dev cpt: 0
CPT: "[0,1]"
peer:
- primary nid: 172.21.42.213 at tcp
Multi-Rail: True
peer ni:
- nid: 172.21.42.213 at tcp
state: NA
max_ni_tx_credits: 8
available_tx_credits: 8
min_tx_credits: 6
tx_q_num_of_buf: 0
available_rtr_credits: 8
min_rtr_credits: 8
send_count: 198
recv_count: 198
drop_count: 0
refcount: 1
- primary nid: 172.21.52.124 at o2ib
Multi-Rail: True
peer ni:
- nid: 172.21.52.124 at o2ib
state: NA
max_ni_tx_credits: 128
available_tx_credits: 128
min_tx_credits: 128
tx_q_num_of_buf: 0
available_rtr_credits: 128
min_rtr_credits: 128
send_count: 0
recv_count: 0
drop_count: 0
refcount: 1
- nid: 172.21.52.125 at o2ib
state: NA
max_ni_tx_credits: 128
available_tx_credits: 128
min_tx_credits: 128
tx_q_num_of_buf: 0
available_rtr_credits: 128
min_rtr_credits: 128
send_count: 0
recv_count: 0
drop_count: 0
refcount: 1
numa:
range: 0
here the lnet configuration client side:
[root at drp-tst-lu10:veraldi]# lnetctl export
net:
- net type: lo
local NI(s):
- nid: 0 at lo
status: up
statistics:
send_count: 0
recv_count: 0
drop_count: 0
tunables:
peer_timeout: 0
peer_credits: 0
peer_buffer_credits: 0
credits: 0
lnd tunables:
tcp bonding: 0
dev cpt: 0
CPT: "[0]"
- net type: o2ib
local NI(s):
- nid: 172.21.52.124 at o2ib
status: up
interfaces:
0: ib0
statistics:
send_count: 403742
recv_count: 807391
drop_count: 0
tunables:
peer_timeout: 180
peer_credits: 128
peer_buffer_credits: 0
credits: 1024
lnd tunables:
peercredits_hiw: 64
map_on_demand: 32
concurrent_sends: 256
fmr_pool_size: 2048
fmr_flush_trigger: 512
fmr_cache: 1
ntx: 2048
conns_per_peer: 4
tcp bonding: 0
dev cpt: -1
CPT: "[0]"
- nid: 172.21.52.125 at o2ib
status: up
interfaces:
0: ib1
statistics:
send_count: 0
recv_count: 0
drop_count: 0
tunables:
peer_timeout: 180
peer_credits: 128
peer_buffer_credits: 0
credits: 1024
lnd tunables:
peercredits_hiw: 64
map_on_demand: 32
concurrent_sends: 256
fmr_pool_size: 2048
fmr_flush_trigger: 512
fmr_cache: 1
ntx: 2048
conns_per_peer: 4
tcp bonding: 0
dev cpt: -1
CPT: "[0]"
- net type: tcp
local NI(s):
- nid: 172.21.42.195 at tcp
status: up
interfaces:
0: enp7s0f0
statistics:
send_count: 99
recv_count: 99
drop_count: 0
tunables:
peer_timeout: 180
peer_credits: 8
peer_buffer_credits: 0
credits: 256
lnd tunables:
tcp bonding: 0
dev cpt: -1
CPT: "[0]"
peer:
- primary nid: 172.21.42.213 at tcp
Multi-Rail: True
peer ni:
- nid: 172.21.42.213 at tcp
state: NA
max_ni_tx_credits: 8
available_tx_credits: 8
min_tx_credits: 6
tx_q_num_of_buf: 0
available_rtr_credits: 8
min_rtr_credits: 8
send_count: 99
recv_count: 99
drop_count: 0
refcount: 1
- primary nid: 172.21.52.86 at o2ib
Multi-Rail: True
peer ni:
- nid: 172.21.52.86 at o2ib
state: NA
max_ni_tx_credits: 128
available_tx_credits: 128
min_tx_credits: 102
tx_q_num_of_buf: 0
available_rtr_credits: 128
min_rtr_credits: 128
send_count: 403742
recv_count: 807391
drop_count: 0
refcount: 1
- nid: 172.21.52.118 at o2ib
state: NA
max_ni_tx_credits: 128
available_tx_credits: 128
min_tx_credits: 127
tx_q_num_of_buf: 0
available_rtr_credits: 128
min_rtr_credits: 128
send_count: 0
recv_count: 0
drop_count: 0
refcount: 1
numa:
range: 0
anyway Lustre does not work. This is really weird. it should.
Any hints ?
thank you
Rick
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20171009/b0993c52/attachment-0001.html>
More information about the lustre-discuss
mailing list