[Lustre-discuss] problem in lustre lnet routing

Larry tsrjzq at gmail.com
Wed Nov 23 19:38:47 PST 2011


Hi all,

I have a problem in setting lnet routing. The MDS and OSSes have IB
and GigE networks,  30.9.100.* for IB and 20.9.100.* for GigE. Most of
the clients have IB, too. But a few of them haven't. So I choose one
client as a lnet router. Below is the configurations:

On the MDS and OSSes,

IB: 30.9.100.*
GigE: 20.9.100.*
modprobe.conf: options lnet networks="o2ib0(ib0)" routes="tcp0 30.9.0.5 at o2ib0"

On the router,

IB 30.9.0.5
GigE: 20.9.0.5
modprobe.conf: options lnet networks="o2ib0(ib0),tcp0(eth1)"
forwarding="enabled"

On the GigE client,

GigE: 20.9.0.2
modprobe.conf: options lnet networks="tcp0(eth1)" routes="o2ib0 20.9.0.5 at tcp0"


After the lnet configured,client can lctl ping every MDS and OSSes .
For example,

client:~ # lctl ping 30.9.100.31 at o2ib
12345-0 at lo
12345-30.9.100.31 at o2ib

where 30.9.100.31 is MDS.

But mount -t lustre 30.9.100.31 at o2ib0:30.9.100.32 at o2ib0:/fnfs /mnt
failed, the log says,


Nov 24 10:36:37 cn-fn02 kernel: [502743.285050] Lustre: OBD class
driver, http://wiki.whamcloud.com/
Nov 24 10:36:37 cn-fn02 kernel: [502743.285056] Lustre:         Lustre
Version: 2.1.0
Nov 24 10:36:37 cn-fn02 kernel: [502743.285060] Lustre:         Build
Version: RC2-g9d71fe8-PRISTINE-2.6.32.12-0.7-default
Nov 24 10:36:37 cn-fn02 kernel: [502743.287057] Lustre: Lustre LU
module (ffffffffa17f6d00).
Nov 24 10:36:37 cn-fn02 kernel: [502743.358095] Lustre: Added LNI
20.9.0.2 at tcp [8/256/0/180]
Nov 24 10:36:37 cn-fn02 kernel: [502743.358153] Lustre: Accept secure, port 988
Nov 24 10:36:37 cn-fn02 kernel: [502743.423409] Lustre: Lustre OSC
module (ffffffffa1a9b800).
Nov 24 10:36:37 cn-fn02 kernel: [502743.438668] Lustre: Lustre LOV
module (ffffffffa1b09500).
Nov 24 10:36:37 cn-fn02 kernel: [502743.460108] Lustre: Lustre client
module (ffffffffa1ba9a40).
Nov 24 10:36:37 cn-fn02 kernel: [502743.480266] Lustre:
4329:0:(sec.c:1474:sptlrpc_import_sec_adapt()) import
MGC30.9.100.31 at o2ib->MGC30.9.100.31 at o2ib_0 neti
d 20000: select flavor null
Nov 24 10:36:37 cn-fn02 kernel: [502743.485938] Lustre:
MGC30.9.100.31 at o2ib: Reactivating import
Nov 24 10:36:37 cn-fn02 kernel: [502743.517528] Lustre:
4329:0:(sec.c:1474:sptlrpc_import_sec_adapt()) import
fnfs-MDT0000-mdc-ffff8801b79afc00->30.9.100.31@
o2ib netid 20000: select flavor null
Nov 24 10:36:42 cn-fn02 kernel: [502748.508709] Lustre:
4401:0:(client.c:1778:ptlrpc_expire_one_request()) @@@ Request
x1386324633321488 sent from fnfs-MDT00
00-mdc-ffff8801b79afc00 to NID 20.9.100.31 at tcp has timed out for sent
delay: [sent 1322102197] [real_sent 0] [current 1322102202] [deadline
5s] [delay 0s]  r
eq at ffff88019c603c00 x1386324633321488/t0(0)
o-1->fnfs-MDT0000_UUID at 30.9.100.31@o2ib:12/10 lens 368/512 e 0 to 1 dl
1322102202 ref 2 fl Rpc:XN/ffffffff/ffffff
ff rc 0/-1
Nov 24 10:37:07 cn-fn02 kernel: [502773.472069] Lustre:
4401:0:(client.c:1778:ptlrpc_expire_one_request()) @@@ Request
x1386324633321491 sent from fnfs-MDT00
00-mdc-ffff8801b79afc00 to NID 30.9.100.32 at o2ib has timed out for slow
reply: [sent 1322102222] [real_sent 1322102222] [current 1322102227]
[deadline 5s] [de
lay 0s]  req at ffff88019b092400 x1386324633321491/t0(0)
o-1->fnfs-MDT0000_UUID at 30.9.100.32@o2ib:12/10 lens 368/512 e 0 to 1 dl
1322102227 ref 1 fl Rpc:XN/fffff
fff/ffffffff rc 0/-1
Nov 24 10:37:27 cn-fn02 kernel: [502793.442762] Lustre:
4402:0:(import.c:526:import_select_connection())
fnfs-MDT0000-mdc-ffff8801b79afc00: tried all connect
ions, increasing latency to 5s
Nov 24 10:37:27 cn-fn02 kernel: [502793.442802] Lustre:
4401:0:(client.c:1778:ptlrpc_expire_one_request()) @@@ Request
x1386324633321493 sent from fnfs-MDT00
00-mdc-ffff8801b79afc00 to NID 20.9.100.31 at tcp has failed due to
network error: [sent 1322102247] [real_sent 1322102247] [current
1322102247] [deadline 10s]
[delay -10s]  req at ffff8801b68ebc00 x1386324633321493/t0(0)
o-1->fnfs-MDT0000_UUID at 30.9.100.31@o2ib:12/10 lens 368/512 e 0 to 1 dl
1322102257 ref 1 fl Rpc:XN/
ffffffff/ffffffff rc 0/-1
Nov 24 10:38:02 cn-fn02 kernel: [502828.392144] Lustre:
4401:0:(client.c:1778:ptlrpc_expire_one_request()) @@@ Request
x1386324633321495 sent from fnfs-MDT00
00-mdc-ffff8801b79afc00 to NID 30.9.100.32 at o2ib has timed out for slow
reply: [sent 1322102272] [real_sent 1322102272] [current 1322102282]
[deadline 10s] [d
elay 0s]  req at ffff88019c603c00 x1386324633321495/t0(0)
o-1->fnfs-MDT0000_UUID at 30.9.100.32@o2ib:12/10 lens 368/512 e 0 to 1 dl
1322102282 ref 1 fl Rpc:XN/ffff
ffff/ffffffff rc 0/-1
Nov 24 10:38:17 cn-fn02 kernel: [502843.369501] Lustre:
4402:0:(import.c:526:import_select_connection())
fnfs-MDT0000-mdc-ffff8801b79afc00: tried all connect
ions, increasing latency to 10s
Nov 24 10:38:17 cn-fn02 kernel: [502843.369561] Lustre:
4401:0:(client.c:1778:ptlrpc_expire_one_request()) @@@ Request
x1386324633321497 sent from fnfs-MDT00
00-mdc-ffff8801b79afc00 to NID 20.9.100.31 at tcp has failed due to
network error: [sent 1322102297] [real_sent 1322102297] [current
1322102297] [deadline 15s]
[delay -15s]  req at ffff88019b082000 x1386324633321497/t0(0)
o-1->fnfs-MDT0000_UUID at 30.9.100.31@o2ib:12/10 lens 368/512 e 0 to 1 dl
1322102312 ref 1 fl Rpc:XN/
ffffffff/ffffffff rc 0/-1
Nov 24 10:38:57 cn-fn02 kernel: [502883.322837] Lustre:
4401:0:(client.c:1778:ptlrpc_expire_one_request()) @@@ Request
x1386324633321499 sent from fnfs-MDT00
00-mdc-ffff8801b79afc00 to NID 30.9.100.32 at o2ib has timed out for slow
reply: [sent 1322102322] [real_sent 1322102322] [current 1322102337]
[deadline 15s] [d
elay 0s]  req at ffff8801b7860400 x1386324633321499/t0(0)
o-1->fnfs-MDT0000_UUID at 30.9.100.32@o2ib:12/10 lens 368/512 e 0 to 1 dl
1322102337 ref 1 fl Rpc:XN/ffff
ffff/ffffffff rc 0/-1
Nov 24 10:39:07 cn-fn02 kernel: [502893.296214] Lustre:
4402:0:(import.c:526:import_select_connection())
fnfs-MDT0000-mdc-ffff8801b79afc00: tried all connect
ions, increasing latency to 15s
Nov 24 10:39:07 cn-fn02 kernel: [502893.296281] Lustre:
4401:0:(client.c:1778:ptlrpc_expire_one_request()) @@@ Request
x1386324633321501 sent from fnfs-MDT0000-mdc-ffff8801b79afc00 to NID
20.9.100.31 at tcp has failed due to network error: [sent 1322102347]
[real_sent 1322102347] [current 1322102347] [deadline 20s]
[delay -20s]  req at ffff8801b7400400 x1386324633321501/t0(0)
o-1->fnfs-MDT0000_UUID at 30.9.100.31@o2ib:12/10 lens 368/512 e 0 to 1 dl
1322102367 ref 1 fl Rpc:XN/
ffffffff/ffffffff rc 0/-1
Nov 24 10:39:52 cn-fn02 kernel: [502938.234238] Lustre:
4401:0:(client.c:1778:ptlrpc_expire_one_request()) @@@ Request
x1386324633321503 sent from fnfs-MDT00
00-mdc-ffff8801b79afc00 to NID 30.9.100.32 at o2ib has timed out for slow
reply: [sent 1322102372] [real_sent 1322102372] [current 1322102392]
[deadline 20s] [d
elay 0s]  req at ffff88019a9d0000 x1386324633321503/t0(0)
o-1->fnfs-MDT0000_UUID at 30.9.100.32@o2ib:12/10 lens 368/512 e 0 to 1 dl
1322102392 ref 1 fl Rpc:XN/ffff
ffff/ffffffff rc 0/-1
Nov 24 10:39:57 cn-fn02 kernel: [502943.222935] Lustre:
4402:0:(import.c:526:import_select_connection())
fnfs-MDT0000-mdc-ffff8801b79afc00: tried all connect
ions, increasing latency to 20s
Nov 24 10:40:01 cn-fn02 /usr/sbin/cron[4509]: (root) CMD ([ -x
/usr/lib64/sa/sa1 ] && exec /usr/lib64/sa/sa1 -S ALL 1 1)
Nov 24 10:40:47 cn-fn02 kernel: [502993.149647] Lustre:
4401:0:(client.c:1778:ptlrpc_expire_one_request()) @@@ Request
x1386324633321507 sent from fnfs-MDT00
00-mdc-ffff8801b79afc00 to NID 30.9.100.32 at o2ib has timed out for slow
reply: [sent 1322102422] [real_sent 1322102422] [current 1322102447]
[deadline 25s] [d
elay 0s]  req at ffff8801b7583000 x1386324633321507/t0(0)
o-1->fnfs-MDT0000_UUID at 30.9.100.32@o2ib:12/10 lens 368/512 e 0 to 1 dl
1322102447 ref 1 fl Rpc:XN/ffff
ffff/ffffffff rc 0/-1
Nov 24 10:40:47 cn-fn02 kernel: [502993.149653] Lustre:
4401:0:(client.c:1778:ptlrpc_expire_one_request()) Skipped 1 previous
similar message
Nov 24 10:41:12 cn-fn02 kernel: [503018.117004] Lustre:
4402:0:(import.c:526:import_select_connection())
fnfs-MDT0000-mdc-ffff8801b79afc00: tried all connect
ions, increasing latency to 25s
Nov 24 10:42:07 cn-fn02 kernel: [503073.041134] Lustre:
4401:0:(client.c:1778:ptlrpc_expire_one_request()) @@@ Request
x1386324633321512 sent from fnfs-MDT00
00-mdc-ffff8801b79afc00 to NID 30.9.100.32 at o2ib has timed out for slow
reply: [sent 1322102497] [real_sent 1322102497] [current 1322102527]
[deadline 30s] [d
elay 0s]  req at ffff88019b08b000 x1386324633321512/t0(0)
o-1->fnfs-MDT0000_UUID at 30.9.100.32@o2ib:12/10 lens 368/512 e 0 to 1 dl
1322102527 ref 1 fl Rpc:XN/ffff
ffff/ffffffff rc 0/-1
Nov 24 10:42:07 cn-fn02 kernel: [503073.041140] Lustre:
4401:0:(client.c:1778:ptlrpc_expire_one_request()) Skipped 1 previous
similar message
Nov 24 10:42:27 cn-fn02 kernel: [503093.011078] Lustre:
4402:0:(import.c:526:import_select_connection())
fnfs-MDT0000-mdc-ffff8801b79afc00: tried all connect
ions, increasing latency to 30s

............

I wonder why the client connects 20.9.100.31 at tcp and 20.9.100.32 at o2ib,
not 20.9.100.31 at o2ib? 20.9.100.31 is my active MDS, 20.9.100.32 is
just a standby one.

Thanks a lot!



More information about the lustre-discuss mailing list