[lustre-discuss] lctl ping node28 at o2ib report Input/output error

yu sun sunyu1949 at gmail.com
Sun Jul 1 20:36:40 PDT 2018


ok, thanks I will try today.

Your
Yu

Cory Spitz <spitzcor at cray.com> 于2018年6月30日周六 上午12:14写道:

> FYI, there is a helpful guide to LNet setup at
> http://wiki.lustre.org/LNet_Router_Config_Guide.  Despite the title, it
> is applicable to non-routed cases as well.
> -Cory
>
> --
>
> On 6/29/18, 1:06 AM, "lustre-discuss on behalf of Andreas Dilger" <
> lustre-discuss-bounces at lists.lustre.org on behalf of adilger at whamcloud.com>
> wrote:
>
>     On Jun 28, 2018, at 21:14, yu sun <sunyu1949 at gmail.com> wrote:
>     >
>     > all server and client that fore-mentioned is using netmasks
> 255.255.255.224.  and they can ping with each other, for example:
>     >
>     > root at ml-gpu-ser200.nmg01:~$ ping node28
>     > PING node28 (10.82.143.202) 56(84) bytes of data.
>     > 64 bytes from node28 (10.82.143.202): icmp_seq=1 ttl=61 time=0.047 ms
>     > 64 bytes from node28 (10.82.143.202): icmp_seq=2 ttl=61 time=0.028 ms
>     >
>     > --- node28 ping statistics ---
>     > 2 packets transmitted, 2 received, 0% packet loss, time 999ms
>     > rtt min/avg/max/mdev = 0.028/0.037/0.047/0.011 ms
>     > root at ml-gpu-ser200.nmg01:~$ lctl ping node28 at o2ib1
>     > failed to ping 10.82.143.202 at o2ib1: Input/output error
>     > root at ml-gpu-ser200.nmg01:~$
>     >
>     >  and we also have hundreds of GPU machines with different IP
> Subnet,  they are in service and it's difficulty to change the network
> structure. so any material or document can guide me solve this by don't
> change network structure.
>
>     The regular IP "ping" is being routed by an IP router, but that doesn't
>     work with IB networks, AFAIK.  The IB interfaces need to be on the same
>     subnet, you need to have an IB interface on each subnet configured on
>     each subnet (which might get ugly if you have a large number of
> subnets)
>     or you need to use LNet routers that are connected to each IB subnet to
>     do the routing (each subnet would be a separate LNet network, for
> example
>     10.82.142.202 at o2ib23 or whatever).
>
>     The other option would be to use the IPoIB layer with socklnd (e.g.
>     10.82.142.202 at tcp) but this would not run as fast as native verbs.
>
>     Cheers, Andreas
>
>
>     > Mohr Jr, Richard Frank (Rick Mohr) <rmohr at utk.edu> 于2018年6月29日周五
> 上午3:30写道:
>     >
>     > > On Jun 27, 2018, at 4:44 PM, Mohr Jr, Richard Frank (Rick Mohr) <
> rmohr at utk.edu> wrote:
>     > >
>     > >
>     > >> On Jun 27, 2018, at 3:12 AM, yu sun <sunyu1949 at gmail.com> wrote:
>     > >>
>     > >> client:
>     > >> root at ml-gpu-ser200.nmg01:~$ mount -t lustre node28 at o2ib1
> :node29 at o2ib1:/project /mnt/lustre_data
>     > >> mount.lustre: mount node28 at o2ib1:node29 at o2ib1:/project at
> /mnt/lustre_data failed: Input/output error
>     > >> Is the MGS running?
>     > >> root at ml-gpu-ser200.nmg01:~$ lctl ping node28 at o2ib1
>     > >> failed to ping 10.82.143.202 at o2ib1: Input/output error
>     > >> root at ml-gpu-ser200.nmg01:~$
>     > >
>     > > In your previous email, you said that you could mount lustre on
> the client ml-gpu-ser200.nmg01.  Was that not accurate, or did something
> change in the meantime?
>     >
>     > (Note: Received out-of-band reply from Yu stating that there was a
> typo in the previous email, and that client ml-gpu-ser200.nmg01 could not
> mount lustre.  Continuing discussion here so others on list can
> follow/benefit.)
>     >
>     > Yu,
>     >
>     > For the IPoIB addresses used on your nodes, what are the subnets
> (and netmasks) that you are using?  It looks like servers use 10.82.143.X
> and clients use 10.82.141.X.  If you are using a 255.255.0.0 netmask, you
> should be fine.  But if you are using 255.255.255.0, then you will run into
> problems.  Lustre expects that all nodes on the same lnet network (o2ib1 in
> your case) will also be on the same IP subnet.
>     >
>     > Have you tried running a regular “ping <IPoIB_address>” command
> between clients and servers to make sure that part is working?
>     >
>     > --
>     > Rick Mohr
>     > Senior HPC System Administrator
>     > National Institute for Computational Sciences
>     > http://www.nics.tennessee.edu
>     >
>     > _______________________________________________
>     > lustre-discuss mailing list
>     > lustre-discuss at lists.lustre.org
>     > http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>
>     Cheers, Andreas
>     ---
>     Andreas Dilger
>     Principal Lustre Architect
>     Whamcloud
>
>
>
>
>
>
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20180702/783e148b/attachment.html>


More information about the lustre-discuss mailing list