[lustre-discuss] difficulties mounting client via an lnet router
Jessica Otey
jotey at nrao.edu
Mon Jul 11 07:34:36 PDT 2016
All,
I am, as before, working on a small test lustre setup (RHEL 6.8, lustre
v. 2.4.3) to prepare for upgrading at 1.8.9 lustre production system to
2.4.3 (first the servers and lnet routers, then at a subsequent time,
the clients). Lustre servers have IB connections, but the clients are 1G
ethernet only.
For the life of me, I cannot get the client to mount via the router on
this test system. (Client will mount fine when router is taken out of
the equation.) This is the error I am seeing in the syslog from the
mount attempt:
Jul 11 10:15:37 tlclient kernel: Lustre:
3605:0:(client.c:1868:ptlrpc_expire_one_request()) @@@ Request sent has
timed out for slow reply: [sent 1468246532/real 1468246532]
req at ffff88032a3f9400 x1539566484848752/t0(0)
o38->tlustre-MDT0000-mdc-ffff88032ad20400 at 10.7.29.130@tcp:12/10 lens
400/544 e 0 to 1 dl 1468246537 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
Jul 11 10:16:07 tlclient kernel: Lustre:
3605:0:(client.c:1868:ptlrpc_expire_one_request()) @@@ Request sent has
timed out for slow reply: [sent 1468246557/real 1468246557]
req at ffff880629819000 x1539566484848764/t0(0)
o38->tlustre-MDT0000-mdc-ffff88032ad20400 at 10.7.29.130@tcp:12/10 lens
400/544 e 0 to 1 dl 1468246567 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
Jul 11 10:16:37 tlclient kernel: Lustre:
3605:0:(client.c:1868:ptlrpc_expire_one_request()) @@@ Request sent has
timed out for slow reply: [sent 1468246582/real 1468246582]
req at ffff88062a371000 x1539566484848772/t0(0)
o38->tlustre-MDT0000-mdc-ffff88032ad20400 at 10.7.29.130@tcp:12/10 lens
400/544 e 0 to 1 dl 1468246597 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
Jul 11 10:16:44 tlclient kernel: LustreError:
2511:0:(lov_obd.c:937:lov_cleanup()) lov tgt 0 not cleaned! deathrow=0,
lovrc=1
Jul 11 10:16:44 tlclient kernel: Lustre: Unmounted tlustre-client
Jul 11 10:16:44 tlclient kernel: LustreError:
4881:0:(obd_mount.c:1289:lustre_fill_super()) Unable to mount (-4)
More than one pair of eyes has looked at the configs and confirmed they
look okay. But frankly we've got to be missing something since this
should (like lustre on a good day) 'just work'.
If anyone has seen this issue before and could give some advice, it'd be
appreciated. One major question I have is whether the problem is a
configuration issue or a procedure issue--perhaps the order in which I
am doing things is causing the failure? The order I'm following
currently is:
1) unmount/remove modules on all boxes
2) bring up the lnet modules on the router, and bring up the network
3) On the mds: add the modules, bring up the network, mount the mdt
4) On the oss: add the modules, bring up the network, mount the oss
5) On the client: add the modules, bring up the network, attempt to
mount client (fails)
Configs follow below.
Thanks in advance,
Jessica
tlnet (the router)
[root at tlnet ~]# cat /etc/modprobe.d/lustre.conf
# tlnet configuration
alias ib0 ib_ipoib
alias net-pf-27 ib_sdp
options lnet networks="o2ib0(ib0),tcp0(em1)" forwarding="enabled"
[root at tlnet ~]# ifconfig #lo omitted
em1 Link encap:Ethernet HWaddr 78:2B:CB:25:A7:E2
inet addr:10.7.29.134 Bcast:10.7.29.255 Mask:255.255.255.0
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:453441 errors:0 dropped:0 overruns:0 frame:0
TX packets:264313 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:436188202 (415.9 MiB) TX bytes:22274957 (21.2 MiB)
ib0 Link encap:InfiniBand HWaddr
80:00:00:48:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00
inet addr:10.7.129.134 Bcast:10.7.129.255 Mask:255.255.255.0
UP BROADCAST RUNNING MULTICAST MTU:2044 Metric:1
RX packets:650 errors:0 dropped:0 overruns:0 frame:0
TX packets:34 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:256
RX bytes:75376 (73.6 KiB) TX bytes:2904 (2.8 KiB)
tlclient (the client)
[root at tlclient ~]# cat /etc/modprobe.d/lustre.conf
options lnet networks="tcp0(em1)" routes="o2ib0 10.7.29.134 at tcp0"
live_router_check_interval=60 dead_router_check_interval=60
[root at tlclient ~]# ifconfig #lo omitted
em1 Link encap:Ethernet HWaddr 00:26:B9:35:B1:1A
inet addr:10.7.29.132 Bcast:10.7.29.255 Mask:255.255.255.0
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:2817 errors:0 dropped:0 overruns:0 frame:0
TX packets:2233 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:354856 (346.5 KiB) TX bytes:328782 (321.0 KiB)
[root at tlclient ~]# cat /etc/fstab | grep lustre
10.7.129.130 at o2ib0:/tlustre /testlustre lustre
defaults,noauto,user_xattr,flock 0 0
tlmds/tloss (mdt and oss)
[root at tloss ~]# cat /etc/modprobe.d/lustre.conf
alias ib0 ib_ipoib
alias net-pf-27 ib_sdp
options lnet networks="o2ib0(ib0)" routes="tcp0 10.7.129.134 at o2ib0"
live_router_check_interval="60" dead_router_check_interval="60"
tloss ifconfig
[root at tloss ~]# ifconfig #lo omitted
em1 Link encap:Ethernet HWaddr 78:2B:CB:4A:7A:F8
inet addr:10.7.29.131 Bcast:10.7.29.255 Mask:255.255.255.0
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:7939328 errors:0 dropped:0 overruns:0 frame:0
TX packets:4920595 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:7016088640 (6.5 GiB) TX bytes:447490407 (426.7 MiB)
ib0 Link encap:InfiniBand HWaddr
80:00:00:48:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00
inet addr:10.7.129.131 Bcast:10.7.129.255 Mask:255.255.255.0
UP BROADCAST RUNNING MULTICAST MTU:2044 Metric:1
RX packets:484688 errors:0 dropped:0 overruns:0 frame:0
TX packets:62465 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:256
RX bytes:845062706 (805.9 MiB) TX bytes:919378780 (876.7 MiB)
tlmds ifconfig
[root at tlmds ~]# ifconfig #lo omitted
em1 Link encap:Ethernet HWaddr 78:2B:CB:28:1D:00
inet addr:10.7.29.130 Bcast:10.7.29.255 Mask:255.255.255.0
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:7849519 errors:0 dropped:0 overruns:0 frame:0
TX packets:4847566 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:7049031324 (6.5 GiB) TX bytes:484594569 (462.1 MiB)
ib0 Link encap:InfiniBand HWaddr
80:00:00:48:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00
inet addr:10.7.129.130 Bcast:10.7.129.255 Mask:255.255.255.0
UP BROADCAST RUNNING MULTICAST MTU:2044 Metric:1
RX packets:532171 errors:0 dropped:0 overruns:0 frame:0
TX packets:64114 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:256
RX bytes:946230130 (902.3 MiB) TX bytes:821297144 (783.2 MiB)
--
Jessica Otey
System Administrator II
North American ALMA Science Center (NAASC)
National Radio Astronomy Observatory (NRAO)
Charlottesville, Virginia (USA)
More information about the lustre-discuss
mailing list