[lustre-discuss] lustre client can't moutn after configuring LNET with lnetctl
Riccardo Veraldi
Riccardo.Veraldi at cnaf.infn.it
Wed Sep 27 20:22:13 PDT 2017
Hello.
I configure Multi-rail on my lustre environment.
MDS: 172.21.42.213 at tcp
OSS: 172.21.52.118 at o2ib
172.21.52.86 at o2ib
Client: 172.21.52.124 at o2ib
172.21.52.125 at o2ib
[root at drp-tst-oss10:~]# cat /proc/sys/lnet/peers
nid refs state last max rtr min tx min
queue
172.21.52.124 at o2ib 1 NA -1 128 128 128 128 128 0
172.21.52.125 at o2ib 1 NA -1 128 128 128 128 128 0
172.21.42.213 at tcp 1 NA -1 8 8 8 8 6 0
after configuring multi-rail I can see both infiniband interfaces peers on the OSS and on the client side.
Anyway before multi-rail lustre client could mount the lustre FS without problems.
Now after multi-rail is set up the client cannot mount anymore the filesystem.
When I mount lustre from the client (fstab entry):
172.21.42.213 at tcp:/drplu /drplu lustre noauto,lazystatfs,flock, 0 0
the file system cannot be mounted and I got these errors
Sep 27 18:28:46 drp-tst-lu10 kernel: [ 596.842861] Lustre:
2490:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has
failed due to network error: [sent 1506562126/real 1506562126]
req at ffff8808326b2a00 x1579744801849904/t0(0)
o400->drplu-OST0001-osc-ffff88085d134800 at 172.21.52.86@o2ib:28/4 lens
224/224 e 0 to 1 dl 1506562133 ref 1 fl Rpc:eXN/0/ffffffff rc 0/-1
Sep 27 18:28:46 drp-tst-lu10 kernel: [ 596.842872] Lustre:
drplu-OST0001-osc-ffff88085d134800: Connection to drplu-OST0001 (at
172.21.52.86 at o2ib) was lost; in progress operations using this service
will wait for recovery to complete
Sep 27 18:28:46 drp-tst-lu10 kernel: [ 596.843306] Lustre:
drplu-OST0001-osc-ffff88085d134800: Connection restored to
172.21.52.86 at o2ib (at 172.21.52.86 at o2ib)
the mount point appears and disappears every few seconds from "df"
I do not have a clue on how to fix. The multi rail capability is important for me.
I have Lustre 2.10.0 both client side and server side.
here is my lnet.conf on the lustre client side. The one OSS side is
similar just swapped peers for o2ib net.
net:
- net type: lo
local NI(s):
- nid: 0 at lo
status: up
statistics:
send_count: 0
recv_count: 0
drop_count: 0
tunables:
peer_timeout: 0
peer_credits: 0
peer_buffer_credits: 0
credits: 0
lnd tunables:
tcp bonding: 0
dev cpt: 0
CPT: "[0]"
- net type: o2ib
local NI(s):
- nid: 172.21.52.124 at o2ib
status: up
interfaces:
0: ib0
statistics:
send_count: 7
recv_count: 7
drop_count: 0
tunables:
peer_timeout: 180
peer_credits: 128
peer_buffer_credits: 0
credits: 1024
lnd tunables:
peercredits_hiw: 64
map_on_demand: 32
concurrent_sends: 256
fmr_pool_size: 2048
fmr_flush_trigger: 512
fmr_cache: 1
ntx: 2048
conns_per_peer: 4
tcp bonding: 0
dev cpt: -1
CPT: "[0]"
- nid: 172.21.52.125 at o2ib
status: up
interfaces:
0: ib1
statistics:
send_count: 5
recv_count: 5
drop_count: 0
tunables:
peer_timeout: 180
peer_credits: 128
peer_buffer_credits: 0
credits: 1024
lnd tunables:
peercredits_hiw: 64
map_on_demand: 32
concurrent_sends: 256
fmr_pool_size: 2048
fmr_flush_trigger: 512
fmr_cache: 1
ntx: 2048
conns_per_peer: 4
tcp bonding: 0
dev cpt: -1
CPT: "[0]"
- net type: tcp
local NI(s):
- nid: 172.21.42.195 at tcp
status: up
interfaces:
0: enp7s0f0
statistics:
send_count: 51
recv_count: 51
drop_count: 0
tunables:
peer_timeout: 180
peer_credits: 8
peer_buffer_credits: 0
credits: 256
lnd tunables:
tcp bonding: 0
dev cpt: -1
CPT: "[0]"
peer:
- primary nid: 172.21.42.213 at tcp
Multi-Rail: False
peer ni:
- nid: 172.21.42.213 at tcp
state: NA
max_ni_tx_credits: 8
available_tx_credits: 8
min_tx_credits: 6
tx_q_num_of_buf: 0
available_rtr_credits: 8
min_rtr_credits: 8
send_count: 0
recv_count: 0
drop_count: 0
refcount: 1
- primary nid: 172.21.52.86 at o2ib
Multi-Rail: True
peer ni:
- nid: 172.21.52.86 at o2ib
state: NA
max_ni_tx_credits: 128
available_tx_credits: 128
min_tx_credits: 128
tx_q_num_of_buf: 0
available_rtr_credits: 128
min_rtr_credits: 128
send_count: 0
recv_count: 0
drop_count: 0
refcount: 1
- nid: 172.21.52.118 at o2ib
state: NA
max_ni_tx_credits: 128
available_tx_credits: 128
min_tx_credits: 128
tx_q_num_of_buf: 0
available_rtr_credits: 128
min_rtr_credits: 128
send_count: 0
recv_count: 0
drop_count: 0
refcount: 1
thank you very much for any hint you may give.
Rick
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20170927/0c4d5c88/attachment.htm>
More information about the lustre-discuss
mailing list