[lustre-discuss] lustre mount in heterogeneous net environment

Tue Feb 27 12:44:08 PST 2018

Hello Jeff,

Yes, I can successfully run "lctl ping" from the client to the Lustre
server and vice versa as you described in:

   - Client on ib0 lnet can `lctl ping ip.of.mds.server at tcp0`
   - MDS on tcp0 can `lctl ping ip.of.client at o2ib`

I have not yet run an iperf nor lnet selftest (lst). I can start on that
now.

Thank you,

megan

On Tue, Feb 27, 2018 at 3:37 PM, Jeff Johnson <
jeff.johnson at aeoncomputing.com> wrote:

> Megan,
>
> I assume by being able to ping from server and client you mean they can
> ping each other.
>
>    - Client on ib0 lnet can `lctl ping ip.of.mds.server at tcp0`
>    - MDS on tcp0 can `lctl ping ip.of.client at o2ib`
>
> If so, can you verify sustained performant throughput from each end to the
> lnet router? On the tcp side you can run iperf (iperf2 or iperf3) to verify
> sustained and stable throughput on the ethernet side. You can use
> ib_send_bw from lnet router to client in a similar way to iperf.
>
> Additionally, you can run lnet_selftest engaging the MDS and client. This
> will test the lnet layer only, if the ethernet or IB layers beneath are
> wonky then lnet and lnet_selftest will not be able to tell you why, just
> that it is.
>
> lnet_selftest method.
>
>
>    1. On both mds and client run `modprobe lnet_selftest`
>    2. On the MDS export
>    3. Save the below script on the MDS
>    4. On the MDS run `export LST_SESSION=41704170
>    5. Run the script.
>
> # lnet_selftest script
>
> conc=8
> export LST_SESSION=41704170
> lst new_session rw
> lst add_group clients clients.ip.addr at o2ib
> lst add_group servers mds.ip.addr at tcp
> lst add_batch bulk_rw
> lst add_test --batch bulk_rw --distribute 1:1 --concurrency ${conc} --from
> clients --to servers brw read size=1M
> lst run bulk_rw
> lst stat clients servers
>
>
> You will see performance stats reported for server and client. To stop,
> ctrl-c and then type `lst end_session`
>
> The value of LST_SESSION is arbitrary but lnet_selftest needs it so the
> background processes can be killed when the benchmark ends.
>
> If lnet_selftest fails then there is something wonky in the routing or the
> network layer (non-lustre) underneath it.
>
> Make sense?
>
> --Jeff
>
>
>
>
> On Tue, Feb 27, 2018 at 12:08 PM, Ms. Megan Larko <dobsonunit at gmail.com>
> wrote:
>
>> Hello List!
>>
>> We have some 2.7.18 lustre servers using TCP.  Through some dual-homed
>> Lustre LNet routes we desire to connect some Mellanox (mlx4) InfiniBand
>> Lustre 2.7.0 clients.
>>
>> The "lctl ping" command works from both the server co-located MGS/MDS and
>> from the client.
>> The mount of the TCP lustre server share from the IB client starts and
>> then shortly thereafter fails with "Input/output error    Is the MGS
>> running?"
>>
>> The Lustre MDS at approximate 20 min. intervals from client mount request
>> /var/log/messages reports:
>> Lustre: MGS: Client <string> (at A.B.C.D at o2ib) reconnecting
>>
>> The IB client mount command:
>> mount -t lustre C.D.E.F at tcp0:/lustre /mnt/lustre
>>
>> Waits about a minute then returns:
>> mount.lustre C.D.E.F at tcp0:/lustre at /mnt/lustre failed:  Input/output
>> error
>> Is the MGS running?.
>>
>> The IB client /var/log/messages file contains:
>> Lustre: client.c:19349:ptlrpc_expire_one_request(()) @@@ Request sent
>> has timed out for slow reply ...... -->MGCC.D.E.F at tcp was lost; in
>> progress operations using this service will fail
>> LustreError: 15c-8: MGCC.D.E.F at tcp: The configuration from log
>> 'lustre-client' failed (-5)  This may be the result of communication errors
>> between this node and the MGS, a bad configuration, or other errors.  See
>> the syslog for more information.
>> Lustre: MGCC.D.E.F at tcp: Connection restored to MGS (at C.D.E.F at tcp)
>> Lustre: Unmounted lustre-client
>> LustreError: 22939:0:(obd_mount.c:lustre_fill_super()) Unable to mount
>> (-5)
>>
>> We have not (yet) set any non-default values on the Lustre File System.
>> *  Server: Lustre 2.7.18  CentOS Linux release 7.3.1611 (Core)  kernel
>> 3.10.0-514.2.2.el7_lustre.x86_64   The server is ethernet; no IB.
>>
>> *  Client: Lustre-2.7.0  RHEL 6.8  kernel 2.6.32-696.3.2.el6.x86_64
>> The client uses Mellanox InfiniBand mlx4.
>>
>> The mount point does exist on the client.   The firewall is not an issue;
>> checked.  SELinux is disabled.
>>
>> NOTE: The server does server the same /lustre file system to other TCP
>> Lustre clients.
>> The client does mount other /lustre_mnt from other IB servers.
>>
>> The info on http://wiki.lustre.org/Mounting_a_Lustre_File_System_on_
>> Client_Nodes describes the situation exceedingly similar to ours.   I'm
>> not sure what Lustre settings to check if I have not explicitly set any to
>> be different that the default value.
>>
>> Any hints would be genuinely appreciated.
>> Cheers,
>> megan
>>
>>
>> _______________________________________________
>> lustre-discuss mailing list
>> lustre-discuss at lists.lustre.org
>> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>>
>>
>
>
> --
> ------------------------------
> Jeff Johnson
> Co-Founder
> Aeon Computing
>
> jeff.johnson at aeoncomputing.com
> www.aeoncomputing.com
> t: 858-412-3810 x1001 <(858)%20412-3810>   f: 858-412-3845
> <(858)%20412-3845>
> m: 619-204-9061 <(619)%20204-9061>
>
> 4170 Morena Boulevard, Suite D - San Diego, CA 92117
> <https://maps.google.com/?q=4170+Morena+Boulevard,+Suite+D+-+San+Diego,+CA+92117&entry=gmail&source=g>
>
> High-Performance Computing / Lustre Filesystems / Scale-out Storage
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20180227/922df998/attachment.html>