[lustre-discuss] lustre-discuss Digest, Vol 147, Issue 43

Ms. Megan Larko dobsonunit at gmail.com
Tue Jul 3 09:58:50 PDT 2018


WRT Subject: lctl ping node28 at o2ib report   Input/output error

Hello Yu,

Just to check the obvious,
--  the recipient system (node28) is running lnet (an "lsmod | grep lnet"
returns the appropriate modules, for example)
--  there is nothing along the path which might be blocking Lustre port 998

Cheers,
megan

On Fri, Jun 29, 2018 at 4:19 PM, <lustre-discuss-request at lists.lustre.org>
wrote:

> Send lustre-discuss mailing list submissions to
>         lustre-discuss at lists.lustre.org
>
> To subscribe or unsubscribe via the World Wide Web, visit
>         http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
> or, via email, send a message with subject or body 'help' to
>         lustre-discuss-request at lists.lustre.org
>
> You can reach the person managing the list at
>         lustre-discuss-owner at lists.lustre.org
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of lustre-discuss digest..."
>
>
> Today's Topics:
>
>    1. Re: lctl ping node28 at o2ib report Input/output error (Cory Spitz)
>
>
> ----------------------------------------------------------------------
>
> Message: 1
> Date: Fri, 29 Jun 2018 16:14:18 +0000
> From: Cory Spitz <spitzcor at cray.com>
> To: Andreas Dilger <adilger at whamcloud.com>, yu sun
>         <sunyu1949 at gmail.com>
> Cc: "lustre-discuss at lists.lustre.org"
>         <lustre-discuss at lists.lustre.org>
> Subject: Re: [lustre-discuss] lctl ping node28 at o2ib report
>         Input/output error
> Message-ID: <AC964404-78C4-4F0F-B894-7619464AFF90 at cray.com>
> Content-Type: text/plain; charset="utf-8"
>
> FYI, there is a helpful guide to LNet setup at
> http://wiki.lustre.org/LNet_Router_Config_Guide.  Despite the title, it
> is applicable to non-routed cases as well.
> -Cory
>
> --
>
> ?On 6/29/18, 1:06 AM, "lustre-discuss on behalf of Andreas Dilger" <
> lustre-discuss-bounces at lists.lustre.org on behalf of adilger at whamcloud.com>
> wrote:
>
>     On Jun 28, 2018, at 21:14, yu sun <sunyu1949 at gmail.com> wrote:
>     >
>     > all server and client that fore-mentioned is using netmasks
> 255.255.255.224.  and they can ping with each other, for example:
>     >
>     > root at ml-gpu-ser200.nmg01:~$ ping node28
>     > PING node28 (10.82.143.202) 56(84) bytes of data.
>     > 64 bytes from node28 (10.82.143.202): icmp_seq=1 ttl=61 time=0.047 ms
>     > 64 bytes from node28 (10.82.143.202): icmp_seq=2 ttl=61 time=0.028 ms
>     >
>     > --- node28 ping statistics ---
>     > 2 packets transmitted, 2 received, 0% packet loss, time 999ms
>     > rtt min/avg/max/mdev = 0.028/0.037/0.047/0.011 ms
>     > root at ml-gpu-ser200.nmg01:~$ lctl ping node28 at o2ib1
>     > failed to ping 10.82.143.202 at o2ib1: Input/output error
>     > root at ml-gpu-ser200.nmg01:~$
>     >
>     >  and we also have hundreds of GPU machines with different IP
> Subnet,  they are in service and it's difficulty to change the network
> structure. so any material or document can guide me solve this by don't
> change network structure.
>
>     The regular IP "ping" is being routed by an IP router, but that doesn't
>     work with IB networks, AFAIK.  The IB interfaces need to be on the same
>     subnet, you need to have an IB interface on each subnet configured on
>     each subnet (which might get ugly if you have a large number of
> subnets)
>     or you need to use LNet routers that are connected to each IB subnet to
>     do the routing (each subnet would be a separate LNet network, for
> example
>     10.82.142.202 at o2ib23 or whatever).
>
>     The other option would be to use the IPoIB layer with socklnd (e.g.
>     10.82.142.202 at tcp) but this would not run as fast as native verbs.
>
>     Cheers, Andreas
>
>
>     > Mohr Jr, Richard Frank (Rick Mohr) <rmohr at utk.edu> ?2018?6?29???
> ??3:30???
>     >
>     > > On Jun 27, 2018, at 4:44 PM, Mohr Jr, Richard Frank (Rick Mohr) <
> rmohr at utk.edu> wrote:
>     > >
>     > >
>     > >> On Jun 27, 2018, at 3:12 AM, yu sun <sunyu1949 at gmail.com> wrote:
>     > >>
>     > >> client:
>     > >> root at ml-gpu-ser200.nmg01:~$ mount -t lustre node28 at o2ib1
> :node29 at o2ib1:/project /mnt/lustre_data
>     > >> mount.lustre: mount node28 at o2ib1:node29 at o2ib1:/project at
> /mnt/lustre_data failed: Input/output error
>     > >> Is the MGS running?
>     > >> root at ml-gpu-ser200.nmg01:~$ lctl ping node28 at o2ib1
>     > >> failed to ping 10.82.143.202 at o2ib1: Input/output error
>     > >> root at ml-gpu-ser200.nmg01:~$
>     > >
>     > > In your previous email, you said that you could mount lustre on
> the client ml-gpu-ser200.nmg01.  Was that not accurate, or did something
> change in the meantime?
>     >
>     > (Note: Received out-of-band reply from Yu stating that there was a
> typo in the previous email, and that client ml-gpu-ser200.nmg01 could not
> mount lustre.  Continuing discussion here so others on list can
> follow/benefit.)
>     >
>     > Yu,
>     >
>     > For the IPoIB addresses used on your nodes, what are the subnets
> (and netmasks) that you are using?  It looks like servers use 10.82.143.X
> and clients use 10.82.141.X.  If you are using a 255.255.0.0 netmask, you
> should be fine.  But if you are using 255.255.255.0, then you will run into
> problems.  Lustre expects that all nodes on the same lnet network (o2ib1 in
> your case) will also be on the same IP subnet.
>     >
>     > Have you tried running a regular ?ping <IPoIB_address>? command
> between clients and servers to make sure that part is working?
>     >
>     > --
>     > Rick Mohr
>     > Senior HPC System Administrator
>     > National Institute for Computational Sciences
>     > http://www.nics.tennessee.edu
>     >
>     > _______________________________________________
>     > lustre-discuss mailing list
>     > lustre-discuss at lists.lustre.org
>     > http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>
>     Cheers, Andreas
>     ---
>     Andreas Dilger
>     Principal Lustre Architect
>     Whamcloud
>
>
>
>
>
>
>
>
>
>
> ------------------------------
>
> Subject: Digest Footer
>
> _______________________________________________
> lustre-discuss mailing list
> lustre-discuss at lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>
>
> ------------------------------
>
> End of lustre-discuss Digest, Vol 147, Issue 43
> ***********************************************
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20180703/84a37cf3/attachment-0001.html>


More information about the lustre-discuss mailing list