[lustre-discuss] lustre-discuss Digest, Vol 136, Issue 26

Fri Jul 28 10:49:29 PDT 2017

Subject: LNET router (2.10.0) recommendations for  heterogeneous (mlx5,
qib) IB setup

Greetings!

I did not see an answer to the question posed in the subject line above
about heterogenous IB environments so I thought I would chime-in.

One document I have found on the topic of heterogeneous IB environments is
http://wiki.lustre.org/Infiniband_Configuration_Howto

Generally speaking, networks like to be as homogeneous as possible.  That
said, they may not always be such.  If you are working with mlx5, you may
wish to look over LU-7124 and LU-1701 regarding the setting of
peer_credits.  In Lustre versions prior to 2.9.0 the mlx5 did not handle
peer_credits > 16 unless the map_on_demand was set to 256 (which is the
default in newer versions of Lustre, I believe).

Cheers,
megan

On Tue, Jul 25, 2017 at 4:11 PM, <lustre-discuss-request at lists.lustre.org>
wrote:

> Send lustre-discuss mailing list submissions to
>         lustre-discuss at lists.lustre.org
>
> To subscribe or unsubscribe via the World Wide Web, visit
>         http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
> or, via email, send a message with subject or body 'help' to
>         lustre-discuss-request at lists.lustre.org
>
> You can reach the person managing the list at
>         lustre-discuss-owner at lists.lustre.org
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of lustre-discuss digest..."
>
>
> Today's Topics:
>
>    1. Re: Install issues on 2.10.0 (John Casu)
>    2. How does Lustre client side caching work? (Joakim Ziegler)
>    3. LNET router (2.10.0) recommendations for  heterogeneous (mlx5,
>       qib) IB setup (Nathan R.M. Crawford)
>
>
> ----------------------------------------------------------------------
>
> Message: 1
> Date: Tue, 25 Jul 2017 10:52:06 -0700
> From: John Casu <john at chiraldynamics.com>
> To: "Mannthey, Keith" <keith.mannthey at intel.com>, Ben Evans
>         <bevans at cray.com>,      "lustre-discuss at lists.lustre.org"
>         <lustre-discuss at lists.lustre.org>
> Subject: Re: [lustre-discuss] Install issues on 2.10.0
> Message-ID: <96d20a1a-9c15-167d-3538-50721f7872e8 at chiraldynamics.com>
> Content-Type: text/plain; charset=utf-8; format=flowed
>
> Ok, so I assume this is actually a ZFS/SPL bug & not a lustre bug.
> Also, thanks Ben, for the ptr.
>
> many thanks,
> -john
>
> On 7/25/17 10:19 AM, Mannthey, Keith wrote:
> > Host_id is for zpool double import protection.  If a host id is set on a
> zpool (zfs does this automatically) then a HA server can't just import to
> pool (users have to use --force). This makes the system a lot safer from
> double zpool imports.  Call 'genhostid' on your Lustre servers and the
> warning will go away.
> >
> > Thanks,
> >   Keith
> >
> >
> >
> > -----Original Message-----
> > From: lustre-discuss [mailto:lustre-discuss-bounces at lists.lustre.org]
> On Behalf Of Ben Evans
> > Sent: Tuesday, July 25, 2017 10:13 AM
> > To: John Casu <john at chiraldynamics.com>; lustre-discuss at lists.lustre.org
> > Subject: Re: [lustre-discuss] Install issues on 2.10.0
> >
> > health_check moved to /sys/fs/lustre/ along with a bunch of other things.
> >
> > -Ben
> >
> > On 7/25/17, 12:21 PM, "lustre-discuss on behalf of John Casu"
> > <lustre-discuss-bounces at lists.lustre.org on behalf of
> john at chiraldynamics.com> wrote:
> >
> >> Just installed latest 2.10.0 Lustre over ZFS on a vanilla Centos
> >> 7.3.1611 system, using dkms.
> >> ZFS is 0.6.5.11 from zfsonlinux.org, installed w. yum
> >>
> >> Not a single problem during installation, but I am having issues
> >> building a lustre filesystem:
> >> 1. Building a separate mgt doesn't seem to work properly, although the
> >> mgt/mdt combo
> >>     seems to work just fine.
> >> 2. I get spl_hostid not set warnings, which I've never seen before 3.
> >> /proc/fs/lustre/health_check seems to be missing.
> >>
> >> thanks,
> >> -john c
> >>
> >>
> >>
> >> ---------
> >> Building an mgt by itself doesn't seem to work properly:
> >>
> >>> [root at fb-lts-mds0 x86_64]# mkfs.lustre --reformat --mgs
> >>> --force-nohostid --servicenode=192.168.98.113 at tcp \
> >>>                                         --backfstype=zfs mgs/mgt
> >>>
> >>>     Permanent disk data:
> >>> Target:     MGS
> >>> Index:      unassigned
> >>> Lustre FS:
> >>> Mount type: zfs
> >>> Flags:      0x1064
> >>>                (MGS first_time update no_primnode ) Persistent mount
> >>> opts:
> >>> Parameters: failover.node=192.168.98.113 at tcp
> >>> WARNING: spl_hostid not set. ZFS has no zpool import protection
> >>> mkfs_cmd = zfs create -o canmount=off -o xattr=sa mgs/mgt
> >>> WARNING: spl_hostid not set. ZFS has no zpool import protection
> >>> Writing mgs/mgt properties
> >>>    lustre:failover.node=192.168.98.113 at tcp
> >>>    lustre:version=1
> >>>    lustre:flags=4196
> >>>    lustre:index=65535
> >>>    lustre:svname=MGS
> >>> [root at fb-lts-mds0 x86_64]# mount.lustre mgs/mgt /mnt/mgs
> >>> WARNING: spl_hostid not set. ZFS has no zpool import protection
> >>>
> >>> mount.lustre FATAL: unhandled/unloaded fs type 0 'ext3'
> >>
> >> If I build the combo mgt/mdt, things go a lot better:
> >>
> >>>
> >>> [root at fb-lts-mds0 x86_64]# mkfs.lustre --reformat --mgs --mdt
> >>> --force-nohostid --servicenode=192.168.98.113 at tcp --backfstype=zfs
> >>> --index=0 --fsname=test meta/meta
> >>>
> >>>     Permanent disk data:
> >>> Target:     test:MDT0000
> >>> Index:      0
> >>> Lustre FS:  test
> >>> Mount type: zfs
> >>> Flags:      0x1065
> >>>                (MDT MGS first_time update no_primnode )  Persistent
> >>> mount opts:
> >>> Parameters: failover.node=192.168.98.113 at tcp
> >>> WARNING: spl_hostid not set. ZFS has no zpool import protection
> >>> mkfs_cmd = zfs create -o canmount=off -o xattr=sa meta/meta
> >>> WARNING: spl_hostid not set. ZFS has no zpool import protection
> >>> Writing meta/meta properties
> >>>    lustre:failover.node=192.168.98.113 at tcp
> >>>    lustre:version=1
> >>>    lustre:flags=4197
> >>>    lustre:index=0
> >>>    lustre:fsname=test
> >>>    lustre:svname=test:MDT0000
> >>> [root at fb-lts-mds0 x86_64]# mount.lustre meta/meta  /mnt/meta
> >>> WARNING: spl_hostid not set. ZFS has no zpool import protection
> >>> [root at fb-lts-mds0 x86_64]# df
> >>> Filesystem          1K-blocks    Used Available Use% Mounted on
> >>> /dev/mapper/cl-root  52403200 3107560  49295640   6% /
> >>> devtmpfs             28709656       0  28709656   0% /dev
> >>> tmpfs                28720660       0  28720660   0% /dev/shm
> >>> tmpfs                28720660   17384  28703276   1% /run
> >>> tmpfs                28720660       0  28720660   0% /sys/fs/cgroup
> >>> /dev/sdb1             1038336  195484    842852  19% /boot
> >>> /dev/mapper/cl-home  34418260   32944  34385316   1% /home
> >>> tmpfs                 5744132       0   5744132   0% /run/user/0
> >>> meta                 60435328     128  60435200   1% /meta
> >>> meta/meta            59968128    4992  59961088   1% /mnt/meta
> >>> [root at fb-lts-mds0 ~]# ls /proc/fs/lustre/mdt/test-MDT0000/
> >>> async_commit_count     hash_stats               identity_upcall
> >>> num_exports         sync_count
> >>> commit_on_sharing      hsm                      instance
> >>> recovery_status     sync_lock_cancel
> >>> enable_remote_dir      hsm_control              ir_factor
> >>> recovery_time_hard  uuid
> >>> enable_remote_dir_gid  identity_acquire_expire  job_cleanup_interval
> >>> recovery_time_soft
> >>> evict_client           identity_expire          job_stats
> >>> rename_stats
> >>> evict_tgt_nids         identity_flush           md_stats
> >>> root_squash
> >>> exports                identity_info            nosquash_nids
> >>> site_stats
> >>
> >> Also, there's no /proc/fs/lustre/health_check
> >>
> >>> [root at fb-lts-mds0 ~]# ls /proc/fs/lustre/
> >>> fld   llite  lod  lwp  mdd  mdt  mgs      osc      osp  seq
> >>> ldlm  lmv    lov  mdc  mds  mgc  nodemap  osd-zfs  qmt  sptlrpc
> >>
> >>
> >>
> >>
> >> _______________________________________________
> >> lustre-discuss mailing list
> >> lustre-discuss at lists.lustre.org
> >> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
> >
> > _______________________________________________
> > lustre-discuss mailing list
> > lustre-discuss at lists.lustre.org
> > http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
> >
> >
>
>
> ------------------------------
>
> Message: 2
> Date: Tue, 25 Jul 2017 13:09:52 -0500
> From: Joakim Ziegler <joakim at terminalmx.com>
> To: lustre-discuss at lists.lustre.org
> Subject: [lustre-discuss] How does Lustre client side caching work?
> Message-ID:
>         <CABkNrDaecBOJSpUtYJ-Fz5+-8NBB4QqRWtCZEHSPP0y=1=befQ@
> mail.gmail.com>
> Content-Type: text/plain; charset="utf-8"
>
> Hello, I'm pretty new to Lustre, we're looking at setting up a Lustre
> cluster for storage of media assets (something in the 0.5-1PB range to
> start with, maybe 6 OSSes (in HA pairs), running on our existing FDR IB
> network). It looks like a good match for our needs, however, there's an
> area I've been unable to find details about. Note that I'm just
> investigating for now, I have no running Lustre setup.
>
> There are plenty of references to Lustre using client side caching, and how
> the Distributed Lock Manager makes this work. However, I can't find almost
> any information about how the client side cache actually works. When I
> first heard it mentioned, I imagined something like the ZFS L2ARC, where
> you can add a device (say, a couple of SSDs) to the client and point Lustre
> at it to use it for caching. But some references I come across just talk
> about the normal kernel page cache, which is probably smaller and less
> persistent than what I'd like for our usage.
>
> Could anyone enlighten me? I have a large dataset, but clients typically
> use a small part of it at any given time, and uses it quite intensively, so
> a client-side cache (either a read cache or ideally a writeback cache)
> would likely reduce network traffic and server load quite a bit. We've been
> using NFS over RDMA and fscache to get a read cache that does roughly this
> so far on our existing file servers, and it's been quite effective, so I
> imagine we could also benefit from something similar as we move to Lustre.
>
> --
> Joakim Ziegler  -  Supervisor de postproducci?n  -  Terminal
> joakim at terminalmx.com   -   044 55 2971 8514   -   5264 0864
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL: <http://lists.lustre.org/pipermail/lustre-discuss-
> lustre.org/attachments/20170725/72452936/attachment.html>
>
> ------------------------------
>
> Message: 3
> Date: Tue, 25 Jul 2017 11:52:43 -0700
> From: "Nathan R.M. Crawford" <nrcrawfo at uci.edu>
> To: lustre-discuss at lists.lustre.org
> Subject: [lustre-discuss] LNET router (2.10.0) recommendations for
>         heterogeneous (mlx5, qib) IB setup
> Message-ID:
>         <CAO6--ykDsL-VyqJK6=gPHdAY1psmx=mb-C25BeVD--
> cpztSWFQ at mail.gmail.com>
> Content-Type: text/plain; charset="utf-8"
>
> Hi All,
>
>   We are gradually updating a cluster (OS, etc.) in-place, basically
> switching blocks of nodes from the old head node to the new. Until we can
> re-arrange the fabric at the next scheduled machine room power shutdown
> event, we are running two independent Infiniband subnets. As I can't find
> useful documentation on proper IB routing between subnets, I have
> configured one node with an HCA on each IB subnet that does simple IPoIB
> routing and LNET routing.
>
> Brief description:
>   Router node has 24 cores, 128GB RAM, and is running with the in-kernel IB
> drivers from Centos7.3. It connects to the new IB fabric via a Mellanox EDR
> card (MT4115) on ib0, and to the old via a Truescale QDR card (QLE7340) on
> ib1. The old IB is on 10.2.0.0/16 (o2ib0), and the new is 10.201.32.0/19
> (o2ib1).
>
>   The new 2.10.0 server is on the EDR side, and the old 2.8.0 server is on
> the QDR side. Nodes with QDR HCAs already coexist with EDR nodes on the EDR
> subnet without problems.
>
> All Lustre config via /etc/lnet.conf:
> #####
> net:
>     - net type: o2ib1
>       local NI(s):
>         - nid: 10.201.32.11 at o2ib1
>           interfaces:
>               0: ib0
>           tunables:
>               peer_timeout: 180
>               peer_credits: 62
>               peer_buffer_credits: 512
>               credits: 1024
>           lnd tunables:
>               peercredits_hiw: 64
>               map_on_demand: 256
>               concurrent_sends: 62
>               fmr_pool_size: 2048
>               fmr_flush_trigger: 512
>               fmr_cache: 1
>               ntx: 2048
>     - net type: o2ib0
>       local NI(s):
>         - nid: 10.2.1.22 at o2ib0
>           interfaces:
>               0: ib1
>           tunables:
>               peer_timeout: 180
>               peer_credits: 8
>               peer_buffer_credits: 512
>               credits: 1024
>           lnd tunables:
>               map_on_demand: 32
>               concurrent_sends: 16
>               fmr_pool_size: 2048
>               fmr_flush_trigger: 512
>               fmr_cache: 1
>               ntx: 2048
> routing:
>     - small: 16384
>       large: 2048
>       enable: 1
> ####
>
>   While the setup works, I had to drop peer_credits to 8 on the QDR side to
> avoid long periods of stalled traffic. It is probably going to be adequate
> for the remaining month before total shutdown and removal of routers, but I
> would still like to have a better solution in hand.
>
> Questions:
> 1) Is there a well-known good config for a qib<-->mlx5 LNET router?
> 2) Where should I look to identify the cause of stalled traffic, which
> still appears at higher load?
> 3) What parameters should I be playing with to optimize the router?
>
> Thanks,
> Nate
>
>
>
> --
>
> Dr. Nathan Crawford              nathan.crawford at uci.edu
> Modeling Facility Director
> Department of Chemistry
> 1102 Natural Sciences II         Office: 2101 Natural Sciences II
> University of California, Irvine  Phone: 949-824-4508
> Irvine, CA 92697-2025, USA
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL: <http://lists.lustre.org/pipermail/lustre-discuss-
> lustre.org/attachments/20170725/8727c755/attachment-0001.htm>
>
> ------------------------------
>
> Subject: Digest Footer
>
> _______________________________________________
> lustre-discuss mailing list
> lustre-discuss at lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>
>
> ------------------------------
>
> End of lustre-discuss Digest, Vol 136, Issue 26
> ***********************************************
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20170728/23a9b307/attachment-0001.htm>