[lustre-discuss] difficulties mounting client via an lnet router

Jessica Otey jotey at nrao.edu
Mon Jul 11 07:34:36 PDT 2016


All,
I am, as before, working on a small test lustre setup (RHEL 6.8, lustre 
v. 2.4.3) to prepare for upgrading at 1.8.9 lustre production system to 
2.4.3 (first the servers and lnet routers, then at a subsequent time, 
the clients). Lustre servers have IB connections, but the clients are 1G 
ethernet only.

For the life of me, I cannot get the client to mount via the router on 
this test system. (Client will mount fine when router is taken out of 
the equation.) This is the error I am seeing in the syslog from the 
mount attempt:

Jul 11 10:15:37 tlclient kernel: Lustre: 
3605:0:(client.c:1868:ptlrpc_expire_one_request()) @@@ Request sent has 
timed out for slow reply: [sent 1468246532/real 1468246532]  
req at ffff88032a3f9400 x1539566484848752/t0(0) 
o38->tlustre-MDT0000-mdc-ffff88032ad20400 at 10.7.29.130@tcp:12/10 lens 
400/544 e 0 to 1 dl 1468246537 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
Jul 11 10:16:07 tlclient kernel: Lustre: 
3605:0:(client.c:1868:ptlrpc_expire_one_request()) @@@ Request sent has 
timed out for slow reply: [sent 1468246557/real 1468246557]  
req at ffff880629819000 x1539566484848764/t0(0) 
o38->tlustre-MDT0000-mdc-ffff88032ad20400 at 10.7.29.130@tcp:12/10 lens 
400/544 e 0 to 1 dl 1468246567 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
Jul 11 10:16:37 tlclient kernel: Lustre: 
3605:0:(client.c:1868:ptlrpc_expire_one_request()) @@@ Request sent has 
timed out for slow reply: [sent 1468246582/real 1468246582]  
req at ffff88062a371000 x1539566484848772/t0(0) 
o38->tlustre-MDT0000-mdc-ffff88032ad20400 at 10.7.29.130@tcp:12/10 lens 
400/544 e 0 to 1 dl 1468246597 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
Jul 11 10:16:44 tlclient kernel: LustreError: 
2511:0:(lov_obd.c:937:lov_cleanup()) lov tgt 0 not cleaned! deathrow=0, 
lovrc=1
Jul 11 10:16:44 tlclient kernel: Lustre: Unmounted tlustre-client
Jul 11 10:16:44 tlclient kernel: LustreError: 
4881:0:(obd_mount.c:1289:lustre_fill_super()) Unable to mount (-4)

More than one pair of eyes has looked at the configs and confirmed they 
look okay. But frankly we've got to be missing something since this 
should (like lustre on a good day) 'just work'.

If anyone has seen this issue before and could give some advice, it'd be 
appreciated. One major question I have is whether the problem is a 
configuration issue or a procedure issue--perhaps the order in which I 
am doing things is causing the failure? The order I'm following 
currently is:

1) unmount/remove modules on all boxes
2) bring up the lnet modules on the router, and bring up the network
3) On the mds: add the modules, bring up the network, mount the mdt
4) On the oss: add the modules, bring up the network, mount the oss
5) On the client: add the modules, bring up the network, attempt to 
mount client (fails)

Configs follow below.

Thanks in advance,
Jessica

tlnet (the router)
[root at tlnet ~]# cat /etc/modprobe.d/lustre.conf
# tlnet configuration
alias ib0 ib_ipoib
alias net-pf-27 ib_sdp
options lnet networks="o2ib0(ib0),tcp0(em1)" forwarding="enabled"

[root at tlnet ~]# ifconfig #lo omitted
em1       Link encap:Ethernet  HWaddr 78:2B:CB:25:A7:E2
           inet addr:10.7.29.134  Bcast:10.7.29.255 Mask:255.255.255.0
           UP BROADCAST RUNNING MULTICAST  MTU:1500 Metric:1
           RX packets:453441 errors:0 dropped:0 overruns:0 frame:0
           TX packets:264313 errors:0 dropped:0 overruns:0 carrier:0
           collisions:0 txqueuelen:1000
           RX bytes:436188202 (415.9 MiB)  TX bytes:22274957 (21.2 MiB)
ib0       Link encap:InfiniBand  HWaddr 
80:00:00:48:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00
           inet addr:10.7.129.134  Bcast:10.7.129.255 Mask:255.255.255.0
           UP BROADCAST RUNNING MULTICAST  MTU:2044 Metric:1
           RX packets:650 errors:0 dropped:0 overruns:0 frame:0
           TX packets:34 errors:0 dropped:0 overruns:0 carrier:0
           collisions:0 txqueuelen:256
           RX bytes:75376 (73.6 KiB)  TX bytes:2904 (2.8 KiB)

tlclient (the client)
[root at tlclient ~]# cat /etc/modprobe.d/lustre.conf
options lnet networks="tcp0(em1)" routes="o2ib0 10.7.29.134 at tcp0" 
live_router_check_interval=60 dead_router_check_interval=60

[root at tlclient ~]# ifconfig #lo omitted
em1       Link encap:Ethernet  HWaddr 00:26:B9:35:B1:1A
           inet addr:10.7.29.132  Bcast:10.7.29.255 Mask:255.255.255.0
           UP BROADCAST RUNNING MULTICAST  MTU:1500 Metric:1
           RX packets:2817 errors:0 dropped:0 overruns:0 frame:0
           TX packets:2233 errors:0 dropped:0 overruns:0 carrier:0
           collisions:0 txqueuelen:1000
           RX bytes:354856 (346.5 KiB)  TX bytes:328782 (321.0 KiB)

[root at tlclient ~]# cat /etc/fstab | grep lustre
10.7.129.130 at o2ib0:/tlustre    /testlustre    lustre 
defaults,noauto,user_xattr,flock  0 0

tlmds/tloss (mdt and oss)
[root at tloss ~]# cat /etc/modprobe.d/lustre.conf
alias ib0 ib_ipoib
alias net-pf-27 ib_sdp
options lnet networks="o2ib0(ib0)" routes="tcp0 10.7.129.134 at o2ib0" 
live_router_check_interval="60" dead_router_check_interval="60"

tloss ifconfig
[root at tloss ~]# ifconfig #lo omitted
em1       Link encap:Ethernet  HWaddr 78:2B:CB:4A:7A:F8
           inet addr:10.7.29.131  Bcast:10.7.29.255 Mask:255.255.255.0
           UP BROADCAST RUNNING MULTICAST  MTU:1500 Metric:1
           RX packets:7939328 errors:0 dropped:0 overruns:0 frame:0
           TX packets:4920595 errors:0 dropped:0 overruns:0 carrier:0
           collisions:0 txqueuelen:1000
           RX bytes:7016088640 (6.5 GiB)  TX bytes:447490407 (426.7 MiB)
ib0       Link encap:InfiniBand  HWaddr 
80:00:00:48:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00
           inet addr:10.7.129.131  Bcast:10.7.129.255 Mask:255.255.255.0
           UP BROADCAST RUNNING MULTICAST  MTU:2044 Metric:1
           RX packets:484688 errors:0 dropped:0 overruns:0 frame:0
           TX packets:62465 errors:0 dropped:0 overruns:0 carrier:0
           collisions:0 txqueuelen:256
           RX bytes:845062706 (805.9 MiB)  TX bytes:919378780 (876.7 MiB)

tlmds ifconfig
[root at tlmds ~]# ifconfig #lo omitted
em1       Link encap:Ethernet  HWaddr 78:2B:CB:28:1D:00
           inet addr:10.7.29.130  Bcast:10.7.29.255 Mask:255.255.255.0
           UP BROADCAST RUNNING MULTICAST  MTU:1500 Metric:1
           RX packets:7849519 errors:0 dropped:0 overruns:0 frame:0
           TX packets:4847566 errors:0 dropped:0 overruns:0 carrier:0
           collisions:0 txqueuelen:1000
           RX bytes:7049031324 (6.5 GiB)  TX bytes:484594569 (462.1 MiB)

ib0       Link encap:InfiniBand  HWaddr 
80:00:00:48:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00
           inet addr:10.7.129.130  Bcast:10.7.129.255 Mask:255.255.255.0
           UP BROADCAST RUNNING MULTICAST  MTU:2044 Metric:1
           RX packets:532171 errors:0 dropped:0 overruns:0 frame:0
           TX packets:64114 errors:0 dropped:0 overruns:0 carrier:0
           collisions:0 txqueuelen:256
           RX bytes:946230130 (902.3 MiB)  TX bytes:821297144 (783.2 MiB)

-- 
Jessica Otey
System Administrator II
North American ALMA Science Center (NAASC)
National Radio Astronomy Observatory (NRAO)
Charlottesville, Virginia (USA)



More information about the lustre-discuss mailing list