<html>
<head>
<meta http-equiv="content-type" content="text/html; charset=utf-8">
</head>
<body bgcolor="#FFFFFF" text="#000000">
<p>Hello.</p>
<p>I configure Multi-rail on my lustre environment.</p>
<pre wrap="">MDS: 172.21.42.213@tcp
OSS: 172.21.52.118@o2ib
172.21.52.86@o2ib
Client: 172.21.52.124@o2ib
172.21.52.125@o2ib
[root@drp-tst-oss10:~]# cat /proc/sys/lnet/peers
nid refs state last max rtr min tx min
queue
172.21.52.124@o2ib 1 NA -1 128 128 128 128 128 0
172.21.52.125@o2ib 1 NA -1 128 128 128 128 128 0
172.21.42.213@tcp 1 NA -1 8 8 8 8 6 0
after configuring multi-rail I can see both infiniband interfaces peers on the OSS and on the client side.
Anyway before multi-rail lustre client could mount the lustre FS without problems.
Now after multi-rail is set up the client cannot mount anymore the filesystem.
When I mount lustre from the client (fstab entry):
172.21.42.213@tcp:/drplu /drplu lustre noauto,lazystatfs,flock, 0 0
the file system cannot be mounted and I got these errors
Sep 27 18:28:46 drp-tst-lu10 kernel: [ 596.842861] Lustre:
2490:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has
failed due to network error: [sent 1506562126/real 1506562126]
req@ffff8808326b2a00 x1579744801849904/t0(0)
o400-><a class="moz-txt-link-abbreviated" href="mailto:drplu-OST0001-osc-ffff88085d134800@172.21.52.86@o2ib:28/4">drplu-OST0001-osc-ffff88085d134800@172.21.52.86@o2ib:28/4</a> lens
224/224 e 0 to 1 dl 1506562133 ref 1 fl Rpc:eXN/0/ffffffff rc 0/-1
Sep 27 18:28:46 drp-tst-lu10 kernel: [ 596.842872] Lustre:
drplu-OST0001-osc-ffff88085d134800: Connection to drplu-OST0001 (at
172.21.52.86@o2ib) was lost; in progress operations using this service
will wait for recovery to complete
Sep 27 18:28:46 drp-tst-lu10 kernel: [ 596.843306] Lustre:
drplu-OST0001-osc-ffff88085d134800: Connection restored to
172.21.52.86@o2ib (at 172.21.52.86@o2ib)
the mount point appears and disappears every few seconds from "df"
I do not have a clue on how to fix. The multi rail capability is important for me.
I have Lustre 2.10.0 both client side and server side.
here is my lnet.conf on the lustre client side. The one OSS side is
similar just swapped peers for o2ib net.
net:
- net type: lo
local NI(s):
- nid: 0@lo
status: up
statistics:
send_count: 0
recv_count: 0
drop_count: 0
tunables:
peer_timeout: 0
peer_credits: 0
peer_buffer_credits: 0
credits: 0
lnd tunables:
tcp bonding: 0
dev cpt: 0
CPT: "[0]"
- net type: o2ib
local NI(s):
- nid: 172.21.52.124@o2ib
status: up
interfaces:
0: ib0
statistics:
send_count: 7
recv_count: 7
drop_count: 0
tunables:
peer_timeout: 180
peer_credits: 128
peer_buffer_credits: 0
credits: 1024
lnd tunables:
peercredits_hiw: 64
map_on_demand: 32
concurrent_sends: 256
fmr_pool_size: 2048
fmr_flush_trigger: 512
fmr_cache: 1
ntx: 2048
conns_per_peer: 4
tcp bonding: 0
dev cpt: -1
CPT: "[0]"
- nid: 172.21.52.125@o2ib
status: up
interfaces:
0: ib1
statistics:
send_count: 5
recv_count: 5
drop_count: 0
tunables:
peer_timeout: 180
peer_credits: 128
peer_buffer_credits: 0
credits: 1024
lnd tunables:
peercredits_hiw: 64
map_on_demand: 32
concurrent_sends: 256
fmr_pool_size: 2048
fmr_flush_trigger: 512
fmr_cache: 1
ntx: 2048
conns_per_peer: 4
tcp bonding: 0
dev cpt: -1
CPT: "[0]"
- net type: tcp
local NI(s):
- nid: 172.21.42.195@tcp
status: up
interfaces:
0: enp7s0f0
statistics:
send_count: 51
recv_count: 51
drop_count: 0
tunables:
peer_timeout: 180
peer_credits: 8
peer_buffer_credits: 0
credits: 256
lnd tunables:
tcp bonding: 0
dev cpt: -1
CPT: "[0]"
peer:
- primary nid: 172.21.42.213@tcp
Multi-Rail: False
peer ni:
- nid: 172.21.42.213@tcp
state: NA
max_ni_tx_credits: 8
available_tx_credits: 8
min_tx_credits: 6
tx_q_num_of_buf: 0
available_rtr_credits: 8
min_rtr_credits: 8
send_count: 0
recv_count: 0
drop_count: 0
refcount: 1
- primary nid: 172.21.52.86@o2ib
Multi-Rail: True
peer ni:
- nid: 172.21.52.86@o2ib
state: NA
max_ni_tx_credits: 128
available_tx_credits: 128
min_tx_credits: 128
tx_q_num_of_buf: 0
available_rtr_credits: 128
min_rtr_credits: 128
send_count: 0
recv_count: 0
drop_count: 0
refcount: 1
- nid: 172.21.52.118@o2ib
state: NA
max_ni_tx_credits: 128
available_tx_credits: 128
min_tx_credits: 128
tx_q_num_of_buf: 0
available_rtr_credits: 128
min_rtr_credits: 128
send_count: 0
recv_count: 0
drop_count: 0
refcount: 1
thank you very much for any hint you may give.
Rick
</pre>
</body>
</html>