[lustre-discuss] lustre-discuss Digest, Vol 136, Issue 26
Nathan R.M. Crawford
nrcrawfo at uci.edu
Mon Jul 31 11:27:52 PDT 2017
Hi Megan,
I had actually seen those recommendations (and others in the Jira
discussions), and used them to get the non-routed mlx5/qib parameters
working. I think the periods of stalled/delayed routed traffic at high
network load might happen even if the IB hardware were homogeneous, just
not get as bad as quickly.
Probably a better question: Are there 2.10-updated recommended procedures
for diagnosing IB LNET routing performance issues?
My specific problems should go away in a few weeks when we unify the IB
fabric. However, a solution will help others in the future.
Thanks,
Nate
On Fri, Jul 28, 2017 at 10:49 AM, Ms. Megan Larko <dobsonunit at gmail.com>
wrote:
> Subject: LNET router (2.10.0) recommendations for heterogeneous (mlx5,
> qib) IB setup
>
> Greetings!
>
> I did not see an answer to the question posed in the subject line above
> about heterogenous IB environments so I thought I would chime-in.
>
> One document I have found on the topic of heterogeneous IB environments is
> http://wiki.lustre.org/Infiniband_Configuration_Howto
>
> Generally speaking, networks like to be as homogeneous as possible. That
> said, they may not always be such. If you are working with mlx5, you may
> wish to look over LU-7124 and LU-1701 regarding the setting of
> peer_credits. In Lustre versions prior to 2.9.0 the mlx5 did not handle
> peer_credits > 16 unless the map_on_demand was set to 256 (which is the
> default in newer versions of Lustre, I believe).
>
> Cheers,
> megan
>
> On Tue, Jul 25, 2017 at 4:11 PM, <lustre-discuss-request at lists.lustre.org>
> wrote:
>
>> Send lustre-discuss mailing list submissions to
>> lustre-discuss at lists.lustre.org
>>
>> To subscribe or unsubscribe via the World Wide Web, visit
>> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>> or, via email, send a message with subject or body 'help' to
>> lustre-discuss-request at lists.lustre.org
>>
>> You can reach the person managing the list at
>> lustre-discuss-owner at lists.lustre.org
>>
>> When replying, please edit your Subject line so it is more specific
>> than "Re: Contents of lustre-discuss digest..."
>>
>>
>> Today's Topics:
>>
>> 1. Re: Install issues on 2.10.0 (John Casu)
>> 2. How does Lustre client side caching work? (Joakim Ziegler)
>> 3. LNET router (2.10.0) recommendations for heterogeneous (mlx5,
>> qib) IB setup (Nathan R.M. Crawford)
>>
>>
>> ----------------------------------------------------------------------
>>
>> Message: 1
>> Date: Tue, 25 Jul 2017 10:52:06 -0700
>> From: John Casu <john at chiraldynamics.com>
>> To: "Mannthey, Keith" <keith.mannthey at intel.com>, Ben Evans
>> <bevans at cray.com>, "lustre-discuss at lists.lustre.org"
>> <lustre-discuss at lists.lustre.org>
>> Subject: Re: [lustre-discuss] Install issues on 2.10.0
>> Message-ID: <96d20a1a-9c15-167d-3538-50721f7872e8 at chiraldynamics.com>
>> Content-Type: text/plain; charset=utf-8; format=flowed
>>
>> Ok, so I assume this is actually a ZFS/SPL bug & not a lustre bug.
>> Also, thanks Ben, for the ptr.
>>
>> many thanks,
>> -john
>>
>> On 7/25/17 10:19 AM, Mannthey, Keith wrote:
>> > Host_id is for zpool double import protection. If a host id is set on
>> a zpool (zfs does this automatically) then a HA server can't just import to
>> pool (users have to use --force). This makes the system a lot safer from
>> double zpool imports. Call 'genhostid' on your Lustre servers and the
>> warning will go away.
>> >
>> > Thanks,
>> > Keith
>> >
>> >
>> >
>> > -----Original Message-----
>> > From: lustre-discuss [mailto:lustre-discuss-bounces at lists.lustre.org]
>> On Behalf Of Ben Evans
>> > Sent: Tuesday, July 25, 2017 10:13 AM
>> > To: John Casu <john at chiraldynamics.com>; lustre-discuss at lists.lustre.or
>> g
>> > Subject: Re: [lustre-discuss] Install issues on 2.10.0
>> >
>> > health_check moved to /sys/fs/lustre/ along with a bunch of other
>> things.
>> >
>> > -Ben
>> >
>> > On 7/25/17, 12:21 PM, "lustre-discuss on behalf of John Casu"
>> > <lustre-discuss-bounces at lists.lustre.org on behalf of
>> john at chiraldynamics.com> wrote:
>> >
>> >> Just installed latest 2.10.0 Lustre over ZFS on a vanilla Centos
>> >> 7.3.1611 system, using dkms.
>> >> ZFS is 0.6.5.11 from zfsonlinux.org, installed w. yum
>> >>
>> >> Not a single problem during installation, but I am having issues
>> >> building a lustre filesystem:
>> >> 1. Building a separate mgt doesn't seem to work properly, although the
>> >> mgt/mdt combo
>> >> seems to work just fine.
>> >> 2. I get spl_hostid not set warnings, which I've never seen before 3.
>> >> /proc/fs/lustre/health_check seems to be missing.
>> >>
>> >> thanks,
>> >> -john c
>> >>
>> >>
>> >>
>> >> ---------
>> >> Building an mgt by itself doesn't seem to work properly:
>> >>
>> >>> [root at fb-lts-mds0 x86_64]# mkfs.lustre --reformat --mgs
>> >>> --force-nohostid --servicenode=192.168.98.113 at tcp \
>> >>> --backfstype=zfs mgs/mgt
>> >>>
>> >>> Permanent disk data:
>> >>> Target: MGS
>> >>> Index: unassigned
>> >>> Lustre FS:
>> >>> Mount type: zfs
>> >>> Flags: 0x1064
>> >>> (MGS first_time update no_primnode ) Persistent mount
>> >>> opts:
>> >>> Parameters: failover.node=192.168.98.113 at tcp
>> >>> WARNING: spl_hostid not set. ZFS has no zpool import protection
>> >>> mkfs_cmd = zfs create -o canmount=off -o xattr=sa mgs/mgt
>> >>> WARNING: spl_hostid not set. ZFS has no zpool import protection
>> >>> Writing mgs/mgt properties
>> >>> lustre:failover.node=192.168.98.113 at tcp
>> >>> lustre:version=1
>> >>> lustre:flags=4196
>> >>> lustre:index=65535
>> >>> lustre:svname=MGS
>> >>> [root at fb-lts-mds0 x86_64]# mount.lustre mgs/mgt /mnt/mgs
>> >>> WARNING: spl_hostid not set. ZFS has no zpool import protection
>> >>>
>> >>> mount.lustre FATAL: unhandled/unloaded fs type 0 'ext3'
>> >>
>> >> If I build the combo mgt/mdt, things go a lot better:
>> >>
>> >>>
>> >>> [root at fb-lts-mds0 x86_64]# mkfs.lustre --reformat --mgs --mdt
>> >>> --force-nohostid --servicenode=192.168.98.113 at tcp --backfstype=zfs
>> >>> --index=0 --fsname=test meta/meta
>> >>>
>> >>> Permanent disk data:
>> >>> Target: test:MDT0000
>> >>> Index: 0
>> >>> Lustre FS: test
>> >>> Mount type: zfs
>> >>> Flags: 0x1065
>> >>> (MDT MGS first_time update no_primnode ) Persistent
>> >>> mount opts:
>> >>> Parameters: failover.node=192.168.98.113 at tcp
>> >>> WARNING: spl_hostid not set. ZFS has no zpool import protection
>> >>> mkfs_cmd = zfs create -o canmount=off -o xattr=sa meta/meta
>> >>> WARNING: spl_hostid not set. ZFS has no zpool import protection
>> >>> Writing meta/meta properties
>> >>> lustre:failover.node=192.168.98.113 at tcp
>> >>> lustre:version=1
>> >>> lustre:flags=4197
>> >>> lustre:index=0
>> >>> lustre:fsname=test
>> >>> lustre:svname=test:MDT0000
>> >>> [root at fb-lts-mds0 x86_64]# mount.lustre meta/meta /mnt/meta
>> >>> WARNING: spl_hostid not set. ZFS has no zpool import protection
>> >>> [root at fb-lts-mds0 x86_64]# df
>> >>> Filesystem 1K-blocks Used Available Use% Mounted on
>> >>> /dev/mapper/cl-root 52403200 3107560 49295640 6% /
>> >>> devtmpfs 28709656 0 28709656 0% /dev
>> >>> tmpfs 28720660 0 28720660 0% /dev/shm
>> >>> tmpfs 28720660 17384 28703276 1% /run
>> >>> tmpfs 28720660 0 28720660 0% /sys/fs/cgroup
>> >>> /dev/sdb1 1038336 195484 842852 19% /boot
>> >>> /dev/mapper/cl-home 34418260 32944 34385316 1% /home
>> >>> tmpfs 5744132 0 5744132 0% /run/user/0
>> >>> meta 60435328 128 60435200 1% /meta
>> >>> meta/meta 59968128 4992 59961088 1% /mnt/meta
>> >>> [root at fb-lts-mds0 ~]# ls /proc/fs/lustre/mdt/test-MDT0000/
>> >>> async_commit_count hash_stats identity_upcall
>> >>> num_exports sync_count
>> >>> commit_on_sharing hsm instance
>> >>> recovery_status sync_lock_cancel
>> >>> enable_remote_dir hsm_control ir_factor
>> >>> recovery_time_hard uuid
>> >>> enable_remote_dir_gid identity_acquire_expire job_cleanup_interval
>> >>> recovery_time_soft
>> >>> evict_client identity_expire job_stats
>> >>> rename_stats
>> >>> evict_tgt_nids identity_flush md_stats
>> >>> root_squash
>> >>> exports identity_info nosquash_nids
>> >>> site_stats
>> >>
>> >> Also, there's no /proc/fs/lustre/health_check
>> >>
>> >>> [root at fb-lts-mds0 ~]# ls /proc/fs/lustre/
>> >>> fld llite lod lwp mdd mdt mgs osc osp seq
>> >>> ldlm lmv lov mdc mds mgc nodemap osd-zfs qmt sptlrpc
>> >>
>> >>
>> >>
>> >>
>> >> _______________________________________________
>> >> lustre-discuss mailing list
>> >> lustre-discuss at lists.lustre.org
>> >> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>> >
>> > _______________________________________________
>> > lustre-discuss mailing list
>> > lustre-discuss at lists.lustre.org
>> > http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>> >
>> >
>>
>>
>> ------------------------------
>>
>> Message: 2
>> Date: Tue, 25 Jul 2017 13:09:52 -0500
>> From: Joakim Ziegler <joakim at terminalmx.com>
>> To: lustre-discuss at lists.lustre.org
>> Subject: [lustre-discuss] How does Lustre client side caching work?
>> Message-ID:
>> <CABkNrDaecBOJSpUtYJ-Fz5+-8NBB4QqRWtCZEHSPP0y=1=befQ at mail.
>> gmail.com>
>> Content-Type: text/plain; charset="utf-8"
>>
>> Hello, I'm pretty new to Lustre, we're looking at setting up a Lustre
>> cluster for storage of media assets (something in the 0.5-1PB range to
>> start with, maybe 6 OSSes (in HA pairs), running on our existing FDR IB
>> network). It looks like a good match for our needs, however, there's an
>> area I've been unable to find details about. Note that I'm just
>> investigating for now, I have no running Lustre setup.
>>
>> There are plenty of references to Lustre using client side caching, and
>> how
>> the Distributed Lock Manager makes this work. However, I can't find almost
>> any information about how the client side cache actually works. When I
>> first heard it mentioned, I imagined something like the ZFS L2ARC, where
>> you can add a device (say, a couple of SSDs) to the client and point
>> Lustre
>> at it to use it for caching. But some references I come across just talk
>> about the normal kernel page cache, which is probably smaller and less
>> persistent than what I'd like for our usage.
>>
>> Could anyone enlighten me? I have a large dataset, but clients typically
>> use a small part of it at any given time, and uses it quite intensively,
>> so
>> a client-side cache (either a read cache or ideally a writeback cache)
>> would likely reduce network traffic and server load quite a bit. We've
>> been
>> using NFS over RDMA and fscache to get a read cache that does roughly this
>> so far on our existing file servers, and it's been quite effective, so I
>> imagine we could also benefit from something similar as we move to Lustre.
>>
>> --
>> Joakim Ziegler - Supervisor de postproducci?n - Terminal
>> joakim at terminalmx.com - 044 55 2971 8514 - 5264 0864
>> -------------- next part --------------
>> An HTML attachment was scrubbed...
>> URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.
>> org/attachments/20170725/72452936/attachment.html>
>>
>> ------------------------------
>>
>> Message: 3
>> Date: Tue, 25 Jul 2017 11:52:43 -0700
>> From: "Nathan R.M. Crawford" <nrcrawfo at uci.edu>
>> To: lustre-discuss at lists.lustre.org
>> Subject: [lustre-discuss] LNET router (2.10.0) recommendations for
>> heterogeneous (mlx5, qib) IB setup
>> Message-ID:
>> <CAO6--ykDsL-VyqJK6=gPHdAY1psmx=mb-C25BeVD--cpztSWFQ at mail.
>> gmail.com>
>> Content-Type: text/plain; charset="utf-8"
>>
>> Hi All,
>>
>> We are gradually updating a cluster (OS, etc.) in-place, basically
>> switching blocks of nodes from the old head node to the new. Until we can
>> re-arrange the fabric at the next scheduled machine room power shutdown
>> event, we are running two independent Infiniband subnets. As I can't find
>> useful documentation on proper IB routing between subnets, I have
>> configured one node with an HCA on each IB subnet that does simple IPoIB
>> routing and LNET routing.
>>
>> Brief description:
>> Router node has 24 cores, 128GB RAM, and is running with the in-kernel
>> IB
>> drivers from Centos7.3. It connects to the new IB fabric via a Mellanox
>> EDR
>> card (MT4115) on ib0, and to the old via a Truescale QDR card (QLE7340) on
>> ib1. The old IB is on 10.2.0.0/16 (o2ib0), and the new is 10.201.32.0/19
>> (o2ib1 <http://10.201.32.0/19(o2ib1>).
>>
>> The new 2.10.0 server is on the EDR side, and the old 2.8.0 server is on
>> the QDR side. Nodes with QDR HCAs already coexist with EDR nodes on the
>> EDR
>> subnet without problems.
>>
>> All Lustre config via /etc/lnet.conf:
>> #####
>> net:
>> - net type: o2ib1
>> local NI(s):
>> - nid: 10.201.32.11 at o2ib1
>> interfaces:
>> 0: ib0
>> tunables:
>> peer_timeout: 180
>> peer_credits: 62
>> peer_buffer_credits: 512
>> credits: 1024
>> lnd tunables:
>> peercredits_hiw: 64
>> map_on_demand: 256
>> concurrent_sends: 62
>> fmr_pool_size: 2048
>> fmr_flush_trigger: 512
>> fmr_cache: 1
>> ntx: 2048
>> - net type: o2ib0
>> local NI(s):
>> - nid: 10.2.1.22 at o2ib0
>> interfaces:
>> 0: ib1
>> tunables:
>> peer_timeout: 180
>> peer_credits: 8
>> peer_buffer_credits: 512
>> credits: 1024
>> lnd tunables:
>> map_on_demand: 32
>> concurrent_sends: 16
>> fmr_pool_size: 2048
>> fmr_flush_trigger: 512
>> fmr_cache: 1
>> ntx: 2048
>> routing:
>> - small: 16384
>> large: 2048
>> enable: 1
>> ####
>>
>> While the setup works, I had to drop peer_credits to 8 on the QDR side
>> to
>> avoid long periods of stalled traffic. It is probably going to be adequate
>> for the remaining month before total shutdown and removal of routers, but
>> I
>> would still like to have a better solution in hand.
>>
>> Questions:
>> 1) Is there a well-known good config for a qib<-->mlx5 LNET router?
>> 2) Where should I look to identify the cause of stalled traffic, which
>> still appears at higher load?
>> 3) What parameters should I be playing with to optimize the router?
>>
>> Thanks,
>> Nate
>>
>>
>>
>> --
>>
>> Dr. Nathan Crawford nathan.crawford at uci.edu
>> Modeling Facility Director
>> Department of Chemistry
>> 1102 Natural Sciences II Office: 2101 Natural Sciences II
>> University of California, Irvine Phone: 949-824-4508
>> Irvine, CA 92697-2025, USA
>> -------------- next part --------------
>> An HTML attachment was scrubbed...
>> URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.
>> org/attachments/20170725/8727c755/attachment-0001.htm>
>>
>> ------------------------------
>>
>> Subject: Digest Footer
>>
>> _______________________________________________
>> lustre-discuss mailing list
>> lustre-discuss at lists.lustre.org
>> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>>
>>
>> ------------------------------
>>
>> End of lustre-discuss Digest, Vol 136, Issue 26
>> ***********************************************
>>
>
>
> _______________________________________________
> lustre-discuss mailing list
> lustre-discuss at lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>
>
--
Dr. Nathan Crawford nathan.crawford at uci.edu
Modeling Facility Director
Department of Chemistry
1102 Natural Sciences II Office: 2101 Natural Sciences II
University of California, Irvine Phone: 949-824-4508
Irvine, CA 92697-2025, USA
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20170731/12472ecb/attachment-0001.htm>
More information about the lustre-discuss
mailing list