[lustre-discuss] lustre-discuss Digest, Vol 147, Issue 43
Ms. Megan Larko
dobsonunit at gmail.com
Tue Jul 3 09:58:50 PDT 2018
WRT Subject: lctl ping node28 at o2ib report Input/output error
Hello Yu,
Just to check the obvious,
-- the recipient system (node28) is running lnet (an "lsmod | grep lnet"
returns the appropriate modules, for example)
-- there is nothing along the path which might be blocking Lustre port 998
Cheers,
megan
On Fri, Jun 29, 2018 at 4:19 PM, <lustre-discuss-request at lists.lustre.org>
wrote:
> Send lustre-discuss mailing list submissions to
> lustre-discuss at lists.lustre.org
>
> To subscribe or unsubscribe via the World Wide Web, visit
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
> or, via email, send a message with subject or body 'help' to
> lustre-discuss-request at lists.lustre.org
>
> You can reach the person managing the list at
> lustre-discuss-owner at lists.lustre.org
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of lustre-discuss digest..."
>
>
> Today's Topics:
>
> 1. Re: lctl ping node28 at o2ib report Input/output error (Cory Spitz)
>
>
> ----------------------------------------------------------------------
>
> Message: 1
> Date: Fri, 29 Jun 2018 16:14:18 +0000
> From: Cory Spitz <spitzcor at cray.com>
> To: Andreas Dilger <adilger at whamcloud.com>, yu sun
> <sunyu1949 at gmail.com>
> Cc: "lustre-discuss at lists.lustre.org"
> <lustre-discuss at lists.lustre.org>
> Subject: Re: [lustre-discuss] lctl ping node28 at o2ib report
> Input/output error
> Message-ID: <AC964404-78C4-4F0F-B894-7619464AFF90 at cray.com>
> Content-Type: text/plain; charset="utf-8"
>
> FYI, there is a helpful guide to LNet setup at
> http://wiki.lustre.org/LNet_Router_Config_Guide. Despite the title, it
> is applicable to non-routed cases as well.
> -Cory
>
> --
>
> ?On 6/29/18, 1:06 AM, "lustre-discuss on behalf of Andreas Dilger" <
> lustre-discuss-bounces at lists.lustre.org on behalf of adilger at whamcloud.com>
> wrote:
>
> On Jun 28, 2018, at 21:14, yu sun <sunyu1949 at gmail.com> wrote:
> >
> > all server and client that fore-mentioned is using netmasks
> 255.255.255.224. and they can ping with each other, for example:
> >
> > root at ml-gpu-ser200.nmg01:~$ ping node28
> > PING node28 (10.82.143.202) 56(84) bytes of data.
> > 64 bytes from node28 (10.82.143.202): icmp_seq=1 ttl=61 time=0.047 ms
> > 64 bytes from node28 (10.82.143.202): icmp_seq=2 ttl=61 time=0.028 ms
> >
> > --- node28 ping statistics ---
> > 2 packets transmitted, 2 received, 0% packet loss, time 999ms
> > rtt min/avg/max/mdev = 0.028/0.037/0.047/0.011 ms
> > root at ml-gpu-ser200.nmg01:~$ lctl ping node28 at o2ib1
> > failed to ping 10.82.143.202 at o2ib1: Input/output error
> > root at ml-gpu-ser200.nmg01:~$
> >
> > and we also have hundreds of GPU machines with different IP
> Subnet, they are in service and it's difficulty to change the network
> structure. so any material or document can guide me solve this by don't
> change network structure.
>
> The regular IP "ping" is being routed by an IP router, but that doesn't
> work with IB networks, AFAIK. The IB interfaces need to be on the same
> subnet, you need to have an IB interface on each subnet configured on
> each subnet (which might get ugly if you have a large number of
> subnets)
> or you need to use LNet routers that are connected to each IB subnet to
> do the routing (each subnet would be a separate LNet network, for
> example
> 10.82.142.202 at o2ib23 or whatever).
>
> The other option would be to use the IPoIB layer with socklnd (e.g.
> 10.82.142.202 at tcp) but this would not run as fast as native verbs.
>
> Cheers, Andreas
>
>
> > Mohr Jr, Richard Frank (Rick Mohr) <rmohr at utk.edu> ?2018?6?29???
> ??3:30???
> >
> > > On Jun 27, 2018, at 4:44 PM, Mohr Jr, Richard Frank (Rick Mohr) <
> rmohr at utk.edu> wrote:
> > >
> > >
> > >> On Jun 27, 2018, at 3:12 AM, yu sun <sunyu1949 at gmail.com> wrote:
> > >>
> > >> client:
> > >> root at ml-gpu-ser200.nmg01:~$ mount -t lustre node28 at o2ib1
> :node29 at o2ib1:/project /mnt/lustre_data
> > >> mount.lustre: mount node28 at o2ib1:node29 at o2ib1:/project at
> /mnt/lustre_data failed: Input/output error
> > >> Is the MGS running?
> > >> root at ml-gpu-ser200.nmg01:~$ lctl ping node28 at o2ib1
> > >> failed to ping 10.82.143.202 at o2ib1: Input/output error
> > >> root at ml-gpu-ser200.nmg01:~$
> > >
> > > In your previous email, you said that you could mount lustre on
> the client ml-gpu-ser200.nmg01. Was that not accurate, or did something
> change in the meantime?
> >
> > (Note: Received out-of-band reply from Yu stating that there was a
> typo in the previous email, and that client ml-gpu-ser200.nmg01 could not
> mount lustre. Continuing discussion here so others on list can
> follow/benefit.)
> >
> > Yu,
> >
> > For the IPoIB addresses used on your nodes, what are the subnets
> (and netmasks) that you are using? It looks like servers use 10.82.143.X
> and clients use 10.82.141.X. If you are using a 255.255.0.0 netmask, you
> should be fine. But if you are using 255.255.255.0, then you will run into
> problems. Lustre expects that all nodes on the same lnet network (o2ib1 in
> your case) will also be on the same IP subnet.
> >
> > Have you tried running a regular ?ping <IPoIB_address>? command
> between clients and servers to make sure that part is working?
> >
> > --
> > Rick Mohr
> > Senior HPC System Administrator
> > National Institute for Computational Sciences
> > http://www.nics.tennessee.edu
> >
> > _______________________________________________
> > lustre-discuss mailing list
> > lustre-discuss at lists.lustre.org
> > http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>
> Cheers, Andreas
> ---
> Andreas Dilger
> Principal Lustre Architect
> Whamcloud
>
>
>
>
>
>
>
>
>
>
> ------------------------------
>
> Subject: Digest Footer
>
> _______________________________________________
> lustre-discuss mailing list
> lustre-discuss at lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>
>
> ------------------------------
>
> End of lustre-discuss Digest, Vol 147, Issue 43
> ***********************************************
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20180703/84a37cf3/attachment-0001.html>
More information about the lustre-discuss
mailing list