[lustre-discuss] difficulties mounting client via an lnet router

Oucharek, Doug S doug.s.oucharek at intel.com
Mon Jul 11 08:28:33 PDT 2016


You mentioned that the servers are on the o2ib0 network, but the error messages indicate that the client is trying to communicate with the MDT on the tcp network.   The file system configuration needs to be updated to use the updated NIDs.  

Doug

> On Jul 11, 2016, at 7:34 AM, Jessica Otey <jotey at nrao.edu> wrote:
> 
> All,
> I am, as before, working on a small test lustre setup (RHEL 6.8, lustre v. 2.4.3) to prepare for upgrading at 1.8.9 lustre production system to 2.4.3 (first the servers and lnet routers, then at a subsequent time, the clients). Lustre servers have IB connections, but the clients are 1G ethernet only.
> 
> For the life of me, I cannot get the client to mount via the router on this test system. (Client will mount fine when router is taken out of the equation.) This is the error I am seeing in the syslog from the mount attempt:
> 
> Jul 11 10:15:37 tlclient kernel: Lustre: 3605:0:(client.c:1868:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1468246532/real 1468246532]  req at ffff88032a3f9400 x1539566484848752/t0(0) o38->tlustre-MDT0000-mdc-ffff88032ad20400 at 10.7.29.130@tcp:12/10 lens 400/544 e 0 to 1 dl 1468246537 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
> Jul 11 10:16:07 tlclient kernel: Lustre: 3605:0:(client.c:1868:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1468246557/real 1468246557]  req at ffff880629819000 x1539566484848764/t0(0) o38->tlustre-MDT0000-mdc-ffff88032ad20400 at 10.7.29.130@tcp:12/10 lens 400/544 e 0 to 1 dl 1468246567 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
> Jul 11 10:16:37 tlclient kernel: Lustre: 3605:0:(client.c:1868:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1468246582/real 1468246582]  req at ffff88062a371000 x1539566484848772/t0(0) o38->tlustre-MDT0000-mdc-ffff88032ad20400 at 10.7.29.130@tcp:12/10 lens 400/544 e 0 to 1 dl 1468246597 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
> Jul 11 10:16:44 tlclient kernel: LustreError: 2511:0:(lov_obd.c:937:lov_cleanup()) lov tgt 0 not cleaned! deathrow=0, lovrc=1
> Jul 11 10:16:44 tlclient kernel: Lustre: Unmounted tlustre-client
> Jul 11 10:16:44 tlclient kernel: LustreError: 4881:0:(obd_mount.c:1289:lustre_fill_super()) Unable to mount (-4)
> 
> More than one pair of eyes has looked at the configs and confirmed they look okay. But frankly we've got to be missing something since this should (like lustre on a good day) 'just work'.
> 
> If anyone has seen this issue before and could give some advice, it'd be appreciated. One major question I have is whether the problem is a configuration issue or a procedure issue--perhaps the order in which I am doing things is causing the failure? The order I'm following currently is:
> 
> 1) unmount/remove modules on all boxes
> 2) bring up the lnet modules on the router, and bring up the network
> 3) On the mds: add the modules, bring up the network, mount the mdt
> 4) On the oss: add the modules, bring up the network, mount the oss
> 5) On the client: add the modules, bring up the network, attempt to mount client (fails)
> 
> Configs follow below.
> 
> Thanks in advance,
> Jessica
> 
> tlnet (the router)
> [root at tlnet ~]# cat /etc/modprobe.d/lustre.conf
> # tlnet configuration
> alias ib0 ib_ipoib
> alias net-pf-27 ib_sdp
> options lnet networks="o2ib0(ib0),tcp0(em1)" forwarding="enabled"
> 
> [root at tlnet ~]# ifconfig #lo omitted
> em1       Link encap:Ethernet  HWaddr 78:2B:CB:25:A7:E2
>          inet addr:10.7.29.134  Bcast:10.7.29.255 Mask:255.255.255.0
>          UP BROADCAST RUNNING MULTICAST  MTU:1500 Metric:1
>          RX packets:453441 errors:0 dropped:0 overruns:0 frame:0
>          TX packets:264313 errors:0 dropped:0 overruns:0 carrier:0
>          collisions:0 txqueuelen:1000
>          RX bytes:436188202 (415.9 MiB)  TX bytes:22274957 (21.2 MiB)
> ib0       Link encap:InfiniBand  HWaddr 80:00:00:48:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00
>          inet addr:10.7.129.134  Bcast:10.7.129.255 Mask:255.255.255.0
>          UP BROADCAST RUNNING MULTICAST  MTU:2044 Metric:1
>          RX packets:650 errors:0 dropped:0 overruns:0 frame:0
>          TX packets:34 errors:0 dropped:0 overruns:0 carrier:0
>          collisions:0 txqueuelen:256
>          RX bytes:75376 (73.6 KiB)  TX bytes:2904 (2.8 KiB)
> 
> tlclient (the client)
> [root at tlclient ~]# cat /etc/modprobe.d/lustre.conf
> options lnet networks="tcp0(em1)" routes="o2ib0 10.7.29.134 at tcp0" live_router_check_interval=60 dead_router_check_interval=60
> 
> [root at tlclient ~]# ifconfig #lo omitted
> em1       Link encap:Ethernet  HWaddr 00:26:B9:35:B1:1A
>          inet addr:10.7.29.132  Bcast:10.7.29.255 Mask:255.255.255.0
>          UP BROADCAST RUNNING MULTICAST  MTU:1500 Metric:1
>          RX packets:2817 errors:0 dropped:0 overruns:0 frame:0
>          TX packets:2233 errors:0 dropped:0 overruns:0 carrier:0
>          collisions:0 txqueuelen:1000
>          RX bytes:354856 (346.5 KiB)  TX bytes:328782 (321.0 KiB)
> 
> [root at tlclient ~]# cat /etc/fstab | grep lustre
> 10.7.129.130 at o2ib0:/tlustre    /testlustre    lustre defaults,noauto,user_xattr,flock  0 0
> 
> tlmds/tloss (mdt and oss)
> [root at tloss ~]# cat /etc/modprobe.d/lustre.conf
> alias ib0 ib_ipoib
> alias net-pf-27 ib_sdp
> options lnet networks="o2ib0(ib0)" routes="tcp0 10.7.129.134 at o2ib0" live_router_check_interval="60" dead_router_check_interval="60"
> 
> tloss ifconfig
> [root at tloss ~]# ifconfig #lo omitted
> em1       Link encap:Ethernet  HWaddr 78:2B:CB:4A:7A:F8
>          inet addr:10.7.29.131  Bcast:10.7.29.255 Mask:255.255.255.0
>          UP BROADCAST RUNNING MULTICAST  MTU:1500 Metric:1
>          RX packets:7939328 errors:0 dropped:0 overruns:0 frame:0
>          TX packets:4920595 errors:0 dropped:0 overruns:0 carrier:0
>          collisions:0 txqueuelen:1000
>          RX bytes:7016088640 (6.5 GiB)  TX bytes:447490407 (426.7 MiB)
> ib0       Link encap:InfiniBand  HWaddr 80:00:00:48:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00
>          inet addr:10.7.129.131  Bcast:10.7.129.255 Mask:255.255.255.0
>          UP BROADCAST RUNNING MULTICAST  MTU:2044 Metric:1
>          RX packets:484688 errors:0 dropped:0 overruns:0 frame:0
>          TX packets:62465 errors:0 dropped:0 overruns:0 carrier:0
>          collisions:0 txqueuelen:256
>          RX bytes:845062706 (805.9 MiB)  TX bytes:919378780 (876.7 MiB)
> 
> tlmds ifconfig
> [root at tlmds ~]# ifconfig #lo omitted
> em1       Link encap:Ethernet  HWaddr 78:2B:CB:28:1D:00
>          inet addr:10.7.29.130  Bcast:10.7.29.255 Mask:255.255.255.0
>          UP BROADCAST RUNNING MULTICAST  MTU:1500 Metric:1
>          RX packets:7849519 errors:0 dropped:0 overruns:0 frame:0
>          TX packets:4847566 errors:0 dropped:0 overruns:0 carrier:0
>          collisions:0 txqueuelen:1000
>          RX bytes:7049031324 (6.5 GiB)  TX bytes:484594569 (462.1 MiB)
> 
> ib0       Link encap:InfiniBand  HWaddr 80:00:00:48:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00
>          inet addr:10.7.129.130  Bcast:10.7.129.255 Mask:255.255.255.0
>          UP BROADCAST RUNNING MULTICAST  MTU:2044 Metric:1
>          RX packets:532171 errors:0 dropped:0 overruns:0 frame:0
>          TX packets:64114 errors:0 dropped:0 overruns:0 carrier:0
>          collisions:0 txqueuelen:256
>          RX bytes:946230130 (902.3 MiB)  TX bytes:821297144 (783.2 MiB)
> 
> -- 
> Jessica Otey
> System Administrator II
> North American ALMA Science Center (NAASC)
> National Radio Astronomy Observatory (NRAO)
> Charlottesville, Virginia (USA)
> 
> _______________________________________________
> lustre-discuss mailing list
> lustre-discuss at lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org



More information about the lustre-discuss mailing list