[lustre-discuss] lctl ping node28 at o2ib report Input/output error
yu sun
sunyu1949 at gmail.com
Sun Jul 1 20:36:40 PDT 2018
ok, thanks I will try today.
Your
Yu
Cory Spitz <spitzcor at cray.com> 于2018年6月30日周六 上午12:14写道:
> FYI, there is a helpful guide to LNet setup at
> http://wiki.lustre.org/LNet_Router_Config_Guide. Despite the title, it
> is applicable to non-routed cases as well.
> -Cory
>
> --
>
> On 6/29/18, 1:06 AM, "lustre-discuss on behalf of Andreas Dilger" <
> lustre-discuss-bounces at lists.lustre.org on behalf of adilger at whamcloud.com>
> wrote:
>
> On Jun 28, 2018, at 21:14, yu sun <sunyu1949 at gmail.com> wrote:
> >
> > all server and client that fore-mentioned is using netmasks
> 255.255.255.224. and they can ping with each other, for example:
> >
> > root at ml-gpu-ser200.nmg01:~$ ping node28
> > PING node28 (10.82.143.202) 56(84) bytes of data.
> > 64 bytes from node28 (10.82.143.202): icmp_seq=1 ttl=61 time=0.047 ms
> > 64 bytes from node28 (10.82.143.202): icmp_seq=2 ttl=61 time=0.028 ms
> >
> > --- node28 ping statistics ---
> > 2 packets transmitted, 2 received, 0% packet loss, time 999ms
> > rtt min/avg/max/mdev = 0.028/0.037/0.047/0.011 ms
> > root at ml-gpu-ser200.nmg01:~$ lctl ping node28 at o2ib1
> > failed to ping 10.82.143.202 at o2ib1: Input/output error
> > root at ml-gpu-ser200.nmg01:~$
> >
> > and we also have hundreds of GPU machines with different IP
> Subnet, they are in service and it's difficulty to change the network
> structure. so any material or document can guide me solve this by don't
> change network structure.
>
> The regular IP "ping" is being routed by an IP router, but that doesn't
> work with IB networks, AFAIK. The IB interfaces need to be on the same
> subnet, you need to have an IB interface on each subnet configured on
> each subnet (which might get ugly if you have a large number of
> subnets)
> or you need to use LNet routers that are connected to each IB subnet to
> do the routing (each subnet would be a separate LNet network, for
> example
> 10.82.142.202 at o2ib23 or whatever).
>
> The other option would be to use the IPoIB layer with socklnd (e.g.
> 10.82.142.202 at tcp) but this would not run as fast as native verbs.
>
> Cheers, Andreas
>
>
> > Mohr Jr, Richard Frank (Rick Mohr) <rmohr at utk.edu> 于2018年6月29日周五
> 上午3:30写道:
> >
> > > On Jun 27, 2018, at 4:44 PM, Mohr Jr, Richard Frank (Rick Mohr) <
> rmohr at utk.edu> wrote:
> > >
> > >
> > >> On Jun 27, 2018, at 3:12 AM, yu sun <sunyu1949 at gmail.com> wrote:
> > >>
> > >> client:
> > >> root at ml-gpu-ser200.nmg01:~$ mount -t lustre node28 at o2ib1
> :node29 at o2ib1:/project /mnt/lustre_data
> > >> mount.lustre: mount node28 at o2ib1:node29 at o2ib1:/project at
> /mnt/lustre_data failed: Input/output error
> > >> Is the MGS running?
> > >> root at ml-gpu-ser200.nmg01:~$ lctl ping node28 at o2ib1
> > >> failed to ping 10.82.143.202 at o2ib1: Input/output error
> > >> root at ml-gpu-ser200.nmg01:~$
> > >
> > > In your previous email, you said that you could mount lustre on
> the client ml-gpu-ser200.nmg01. Was that not accurate, or did something
> change in the meantime?
> >
> > (Note: Received out-of-band reply from Yu stating that there was a
> typo in the previous email, and that client ml-gpu-ser200.nmg01 could not
> mount lustre. Continuing discussion here so others on list can
> follow/benefit.)
> >
> > Yu,
> >
> > For the IPoIB addresses used on your nodes, what are the subnets
> (and netmasks) that you are using? It looks like servers use 10.82.143.X
> and clients use 10.82.141.X. If you are using a 255.255.0.0 netmask, you
> should be fine. But if you are using 255.255.255.0, then you will run into
> problems. Lustre expects that all nodes on the same lnet network (o2ib1 in
> your case) will also be on the same IP subnet.
> >
> > Have you tried running a regular “ping <IPoIB_address>” command
> between clients and servers to make sure that part is working?
> >
> > --
> > Rick Mohr
> > Senior HPC System Administrator
> > National Institute for Computational Sciences
> > http://www.nics.tennessee.edu
> >
> > _______________________________________________
> > lustre-discuss mailing list
> > lustre-discuss at lists.lustre.org
> > http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>
> Cheers, Andreas
> ---
> Andreas Dilger
> Principal Lustre Architect
> Whamcloud
>
>
>
>
>
>
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20180702/783e148b/attachment.html>
More information about the lustre-discuss
mailing list