[lustre-discuss] lustre-discuss Digest, Vol 136, Issue 26
Ms. Megan Larko
dobsonunit at gmail.com
Fri Jul 28 10:49:29 PDT 2017
Subject: LNET router (2.10.0) recommendations for heterogeneous (mlx5,
qib) IB setup
Greetings!
I did not see an answer to the question posed in the subject line above
about heterogenous IB environments so I thought I would chime-in.
One document I have found on the topic of heterogeneous IB environments is
http://wiki.lustre.org/Infiniband_Configuration_Howto
Generally speaking, networks like to be as homogeneous as possible. That
said, they may not always be such. If you are working with mlx5, you may
wish to look over LU-7124 and LU-1701 regarding the setting of
peer_credits. In Lustre versions prior to 2.9.0 the mlx5 did not handle
peer_credits > 16 unless the map_on_demand was set to 256 (which is the
default in newer versions of Lustre, I believe).
Cheers,
megan
On Tue, Jul 25, 2017 at 4:11 PM, <lustre-discuss-request at lists.lustre.org>
wrote:
> Send lustre-discuss mailing list submissions to
> lustre-discuss at lists.lustre.org
>
> To subscribe or unsubscribe via the World Wide Web, visit
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
> or, via email, send a message with subject or body 'help' to
> lustre-discuss-request at lists.lustre.org
>
> You can reach the person managing the list at
> lustre-discuss-owner at lists.lustre.org
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of lustre-discuss digest..."
>
>
> Today's Topics:
>
> 1. Re: Install issues on 2.10.0 (John Casu)
> 2. How does Lustre client side caching work? (Joakim Ziegler)
> 3. LNET router (2.10.0) recommendations for heterogeneous (mlx5,
> qib) IB setup (Nathan R.M. Crawford)
>
>
> ----------------------------------------------------------------------
>
> Message: 1
> Date: Tue, 25 Jul 2017 10:52:06 -0700
> From: John Casu <john at chiraldynamics.com>
> To: "Mannthey, Keith" <keith.mannthey at intel.com>, Ben Evans
> <bevans at cray.com>, "lustre-discuss at lists.lustre.org"
> <lustre-discuss at lists.lustre.org>
> Subject: Re: [lustre-discuss] Install issues on 2.10.0
> Message-ID: <96d20a1a-9c15-167d-3538-50721f7872e8 at chiraldynamics.com>
> Content-Type: text/plain; charset=utf-8; format=flowed
>
> Ok, so I assume this is actually a ZFS/SPL bug & not a lustre bug.
> Also, thanks Ben, for the ptr.
>
> many thanks,
> -john
>
> On 7/25/17 10:19 AM, Mannthey, Keith wrote:
> > Host_id is for zpool double import protection. If a host id is set on a
> zpool (zfs does this automatically) then a HA server can't just import to
> pool (users have to use --force). This makes the system a lot safer from
> double zpool imports. Call 'genhostid' on your Lustre servers and the
> warning will go away.
> >
> > Thanks,
> > Keith
> >
> >
> >
> > -----Original Message-----
> > From: lustre-discuss [mailto:lustre-discuss-bounces at lists.lustre.org]
> On Behalf Of Ben Evans
> > Sent: Tuesday, July 25, 2017 10:13 AM
> > To: John Casu <john at chiraldynamics.com>; lustre-discuss at lists.lustre.org
> > Subject: Re: [lustre-discuss] Install issues on 2.10.0
> >
> > health_check moved to /sys/fs/lustre/ along with a bunch of other things.
> >
> > -Ben
> >
> > On 7/25/17, 12:21 PM, "lustre-discuss on behalf of John Casu"
> > <lustre-discuss-bounces at lists.lustre.org on behalf of
> john at chiraldynamics.com> wrote:
> >
> >> Just installed latest 2.10.0 Lustre over ZFS on a vanilla Centos
> >> 7.3.1611 system, using dkms.
> >> ZFS is 0.6.5.11 from zfsonlinux.org, installed w. yum
> >>
> >> Not a single problem during installation, but I am having issues
> >> building a lustre filesystem:
> >> 1. Building a separate mgt doesn't seem to work properly, although the
> >> mgt/mdt combo
> >> seems to work just fine.
> >> 2. I get spl_hostid not set warnings, which I've never seen before 3.
> >> /proc/fs/lustre/health_check seems to be missing.
> >>
> >> thanks,
> >> -john c
> >>
> >>
> >>
> >> ---------
> >> Building an mgt by itself doesn't seem to work properly:
> >>
> >>> [root at fb-lts-mds0 x86_64]# mkfs.lustre --reformat --mgs
> >>> --force-nohostid --servicenode=192.168.98.113 at tcp \
> >>> --backfstype=zfs mgs/mgt
> >>>
> >>> Permanent disk data:
> >>> Target: MGS
> >>> Index: unassigned
> >>> Lustre FS:
> >>> Mount type: zfs
> >>> Flags: 0x1064
> >>> (MGS first_time update no_primnode ) Persistent mount
> >>> opts:
> >>> Parameters: failover.node=192.168.98.113 at tcp
> >>> WARNING: spl_hostid not set. ZFS has no zpool import protection
> >>> mkfs_cmd = zfs create -o canmount=off -o xattr=sa mgs/mgt
> >>> WARNING: spl_hostid not set. ZFS has no zpool import protection
> >>> Writing mgs/mgt properties
> >>> lustre:failover.node=192.168.98.113 at tcp
> >>> lustre:version=1
> >>> lustre:flags=4196
> >>> lustre:index=65535
> >>> lustre:svname=MGS
> >>> [root at fb-lts-mds0 x86_64]# mount.lustre mgs/mgt /mnt/mgs
> >>> WARNING: spl_hostid not set. ZFS has no zpool import protection
> >>>
> >>> mount.lustre FATAL: unhandled/unloaded fs type 0 'ext3'
> >>
> >> If I build the combo mgt/mdt, things go a lot better:
> >>
> >>>
> >>> [root at fb-lts-mds0 x86_64]# mkfs.lustre --reformat --mgs --mdt
> >>> --force-nohostid --servicenode=192.168.98.113 at tcp --backfstype=zfs
> >>> --index=0 --fsname=test meta/meta
> >>>
> >>> Permanent disk data:
> >>> Target: test:MDT0000
> >>> Index: 0
> >>> Lustre FS: test
> >>> Mount type: zfs
> >>> Flags: 0x1065
> >>> (MDT MGS first_time update no_primnode ) Persistent
> >>> mount opts:
> >>> Parameters: failover.node=192.168.98.113 at tcp
> >>> WARNING: spl_hostid not set. ZFS has no zpool import protection
> >>> mkfs_cmd = zfs create -o canmount=off -o xattr=sa meta/meta
> >>> WARNING: spl_hostid not set. ZFS has no zpool import protection
> >>> Writing meta/meta properties
> >>> lustre:failover.node=192.168.98.113 at tcp
> >>> lustre:version=1
> >>> lustre:flags=4197
> >>> lustre:index=0
> >>> lustre:fsname=test
> >>> lustre:svname=test:MDT0000
> >>> [root at fb-lts-mds0 x86_64]# mount.lustre meta/meta /mnt/meta
> >>> WARNING: spl_hostid not set. ZFS has no zpool import protection
> >>> [root at fb-lts-mds0 x86_64]# df
> >>> Filesystem 1K-blocks Used Available Use% Mounted on
> >>> /dev/mapper/cl-root 52403200 3107560 49295640 6% /
> >>> devtmpfs 28709656 0 28709656 0% /dev
> >>> tmpfs 28720660 0 28720660 0% /dev/shm
> >>> tmpfs 28720660 17384 28703276 1% /run
> >>> tmpfs 28720660 0 28720660 0% /sys/fs/cgroup
> >>> /dev/sdb1 1038336 195484 842852 19% /boot
> >>> /dev/mapper/cl-home 34418260 32944 34385316 1% /home
> >>> tmpfs 5744132 0 5744132 0% /run/user/0
> >>> meta 60435328 128 60435200 1% /meta
> >>> meta/meta 59968128 4992 59961088 1% /mnt/meta
> >>> [root at fb-lts-mds0 ~]# ls /proc/fs/lustre/mdt/test-MDT0000/
> >>> async_commit_count hash_stats identity_upcall
> >>> num_exports sync_count
> >>> commit_on_sharing hsm instance
> >>> recovery_status sync_lock_cancel
> >>> enable_remote_dir hsm_control ir_factor
> >>> recovery_time_hard uuid
> >>> enable_remote_dir_gid identity_acquire_expire job_cleanup_interval
> >>> recovery_time_soft
> >>> evict_client identity_expire job_stats
> >>> rename_stats
> >>> evict_tgt_nids identity_flush md_stats
> >>> root_squash
> >>> exports identity_info nosquash_nids
> >>> site_stats
> >>
> >> Also, there's no /proc/fs/lustre/health_check
> >>
> >>> [root at fb-lts-mds0 ~]# ls /proc/fs/lustre/
> >>> fld llite lod lwp mdd mdt mgs osc osp seq
> >>> ldlm lmv lov mdc mds mgc nodemap osd-zfs qmt sptlrpc
> >>
> >>
> >>
> >>
> >> _______________________________________________
> >> lustre-discuss mailing list
> >> lustre-discuss at lists.lustre.org
> >> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
> >
> > _______________________________________________
> > lustre-discuss mailing list
> > lustre-discuss at lists.lustre.org
> > http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
> >
> >
>
>
> ------------------------------
>
> Message: 2
> Date: Tue, 25 Jul 2017 13:09:52 -0500
> From: Joakim Ziegler <joakim at terminalmx.com>
> To: lustre-discuss at lists.lustre.org
> Subject: [lustre-discuss] How does Lustre client side caching work?
> Message-ID:
> <CABkNrDaecBOJSpUtYJ-Fz5+-8NBB4QqRWtCZEHSPP0y=1=befQ@
> mail.gmail.com>
> Content-Type: text/plain; charset="utf-8"
>
> Hello, I'm pretty new to Lustre, we're looking at setting up a Lustre
> cluster for storage of media assets (something in the 0.5-1PB range to
> start with, maybe 6 OSSes (in HA pairs), running on our existing FDR IB
> network). It looks like a good match for our needs, however, there's an
> area I've been unable to find details about. Note that I'm just
> investigating for now, I have no running Lustre setup.
>
> There are plenty of references to Lustre using client side caching, and how
> the Distributed Lock Manager makes this work. However, I can't find almost
> any information about how the client side cache actually works. When I
> first heard it mentioned, I imagined something like the ZFS L2ARC, where
> you can add a device (say, a couple of SSDs) to the client and point Lustre
> at it to use it for caching. But some references I come across just talk
> about the normal kernel page cache, which is probably smaller and less
> persistent than what I'd like for our usage.
>
> Could anyone enlighten me? I have a large dataset, but clients typically
> use a small part of it at any given time, and uses it quite intensively, so
> a client-side cache (either a read cache or ideally a writeback cache)
> would likely reduce network traffic and server load quite a bit. We've been
> using NFS over RDMA and fscache to get a read cache that does roughly this
> so far on our existing file servers, and it's been quite effective, so I
> imagine we could also benefit from something similar as we move to Lustre.
>
> --
> Joakim Ziegler - Supervisor de postproducci?n - Terminal
> joakim at terminalmx.com - 044 55 2971 8514 - 5264 0864
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL: <http://lists.lustre.org/pipermail/lustre-discuss-
> lustre.org/attachments/20170725/72452936/attachment.html>
>
> ------------------------------
>
> Message: 3
> Date: Tue, 25 Jul 2017 11:52:43 -0700
> From: "Nathan R.M. Crawford" <nrcrawfo at uci.edu>
> To: lustre-discuss at lists.lustre.org
> Subject: [lustre-discuss] LNET router (2.10.0) recommendations for
> heterogeneous (mlx5, qib) IB setup
> Message-ID:
> <CAO6--ykDsL-VyqJK6=gPHdAY1psmx=mb-C25BeVD--
> cpztSWFQ at mail.gmail.com>
> Content-Type: text/plain; charset="utf-8"
>
> Hi All,
>
> We are gradually updating a cluster (OS, etc.) in-place, basically
> switching blocks of nodes from the old head node to the new. Until we can
> re-arrange the fabric at the next scheduled machine room power shutdown
> event, we are running two independent Infiniband subnets. As I can't find
> useful documentation on proper IB routing between subnets, I have
> configured one node with an HCA on each IB subnet that does simple IPoIB
> routing and LNET routing.
>
> Brief description:
> Router node has 24 cores, 128GB RAM, and is running with the in-kernel IB
> drivers from Centos7.3. It connects to the new IB fabric via a Mellanox EDR
> card (MT4115) on ib0, and to the old via a Truescale QDR card (QLE7340) on
> ib1. The old IB is on 10.2.0.0/16 (o2ib0), and the new is 10.201.32.0/19
> (o2ib1).
>
> The new 2.10.0 server is on the EDR side, and the old 2.8.0 server is on
> the QDR side. Nodes with QDR HCAs already coexist with EDR nodes on the EDR
> subnet without problems.
>
> All Lustre config via /etc/lnet.conf:
> #####
> net:
> - net type: o2ib1
> local NI(s):
> - nid: 10.201.32.11 at o2ib1
> interfaces:
> 0: ib0
> tunables:
> peer_timeout: 180
> peer_credits: 62
> peer_buffer_credits: 512
> credits: 1024
> lnd tunables:
> peercredits_hiw: 64
> map_on_demand: 256
> concurrent_sends: 62
> fmr_pool_size: 2048
> fmr_flush_trigger: 512
> fmr_cache: 1
> ntx: 2048
> - net type: o2ib0
> local NI(s):
> - nid: 10.2.1.22 at o2ib0
> interfaces:
> 0: ib1
> tunables:
> peer_timeout: 180
> peer_credits: 8
> peer_buffer_credits: 512
> credits: 1024
> lnd tunables:
> map_on_demand: 32
> concurrent_sends: 16
> fmr_pool_size: 2048
> fmr_flush_trigger: 512
> fmr_cache: 1
> ntx: 2048
> routing:
> - small: 16384
> large: 2048
> enable: 1
> ####
>
> While the setup works, I had to drop peer_credits to 8 on the QDR side to
> avoid long periods of stalled traffic. It is probably going to be adequate
> for the remaining month before total shutdown and removal of routers, but I
> would still like to have a better solution in hand.
>
> Questions:
> 1) Is there a well-known good config for a qib<-->mlx5 LNET router?
> 2) Where should I look to identify the cause of stalled traffic, which
> still appears at higher load?
> 3) What parameters should I be playing with to optimize the router?
>
> Thanks,
> Nate
>
>
>
> --
>
> Dr. Nathan Crawford nathan.crawford at uci.edu
> Modeling Facility Director
> Department of Chemistry
> 1102 Natural Sciences II Office: 2101 Natural Sciences II
> University of California, Irvine Phone: 949-824-4508
> Irvine, CA 92697-2025, USA
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL: <http://lists.lustre.org/pipermail/lustre-discuss-
> lustre.org/attachments/20170725/8727c755/attachment-0001.htm>
>
> ------------------------------
>
> Subject: Digest Footer
>
> _______________________________________________
> lustre-discuss mailing list
> lustre-discuss at lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>
>
> ------------------------------
>
> End of lustre-discuss Digest, Vol 136, Issue 26
> ***********************************************
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20170728/23a9b307/attachment-0001.htm>
More information about the lustre-discuss
mailing list