[lustre-discuss] lustre-discuss Digest, Vol 136, Issue 26

Mon Jul 31 11:27:52 PDT 2017

Hi Megan,

  I had actually seen those recommendations (and others in the Jira
discussions), and used them to get the non-routed mlx5/qib parameters
working. I think the periods of stalled/delayed routed traffic at high
network load might happen even if the IB hardware were homogeneous, just
not get as bad as quickly.

  Probably a better question: Are there 2.10-updated recommended procedures
for diagnosing IB LNET routing performance issues?

  My specific problems should go away in a few weeks when we unify the IB
fabric. However, a solution will help others in the future.

Thanks,
Nate

On Fri, Jul 28, 2017 at 10:49 AM, Ms. Megan Larko <dobsonunit at gmail.com>
wrote:

> Subject: LNET router (2.10.0) recommendations for  heterogeneous (mlx5,
> qib) IB setup
>
> Greetings!
>
> I did not see an answer to the question posed in the subject line above
> about heterogenous IB environments so I thought I would chime-in.
>
> One document I have found on the topic of heterogeneous IB environments is
> http://wiki.lustre.org/Infiniband_Configuration_Howto
>
> Generally speaking, networks like to be as homogeneous as possible.  That
> said, they may not always be such.  If you are working with mlx5, you may
> wish to look over LU-7124 and LU-1701 regarding the setting of
> peer_credits.  In Lustre versions prior to 2.9.0 the mlx5 did not handle
> peer_credits > 16 unless the map_on_demand was set to 256 (which is the
> default in newer versions of Lustre, I believe).
>
> Cheers,
> megan
>
> On Tue, Jul 25, 2017 at 4:11 PM, <lustre-discuss-request at lists.lustre.org>
> wrote:
>
>> Send lustre-discuss mailing list submissions to
>>         lustre-discuss at lists.lustre.org
>>
>> To subscribe or unsubscribe via the World Wide Web, visit
>>         http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>> or, via email, send a message with subject or body 'help' to
>>         lustre-discuss-request at lists.lustre.org
>>
>> You can reach the person managing the list at
>>         lustre-discuss-owner at lists.lustre.org
>>
>> When replying, please edit your Subject line so it is more specific
>> than "Re: Contents of lustre-discuss digest..."
>>
>>
>> Today's Topics:
>>
>>    1. Re: Install issues on 2.10.0 (John Casu)
>>    2. How does Lustre client side caching work? (Joakim Ziegler)
>>    3. LNET router (2.10.0) recommendations for  heterogeneous (mlx5,
>>       qib) IB setup (Nathan R.M. Crawford)
>>
>>
>> ----------------------------------------------------------------------
>>
>> Message: 1
>> Date: Tue, 25 Jul 2017 10:52:06 -0700
>> From: John Casu <john at chiraldynamics.com>
>> To: "Mannthey, Keith" <keith.mannthey at intel.com>, Ben Evans
>>         <bevans at cray.com>,      "lustre-discuss at lists.lustre.org"
>>         <lustre-discuss at lists.lustre.org>
>> Subject: Re: [lustre-discuss] Install issues on 2.10.0
>> Message-ID: <96d20a1a-9c15-167d-3538-50721f7872e8 at chiraldynamics.com>
>> Content-Type: text/plain; charset=utf-8; format=flowed
>>
>> Ok, so I assume this is actually a ZFS/SPL bug & not a lustre bug.
>> Also, thanks Ben, for the ptr.
>>
>> many thanks,
>> -john
>>
>> On 7/25/17 10:19 AM, Mannthey, Keith wrote:
>> > Host_id is for zpool double import protection.  If a host id is set on
>> a zpool (zfs does this automatically) then a HA server can't just import to
>> pool (users have to use --force). This makes the system a lot safer from
>> double zpool imports.  Call 'genhostid' on your Lustre servers and the
>> warning will go away.
>> >
>> > Thanks,
>> >   Keith
>> >
>> >
>> >
>> > -----Original Message-----
>> > From: lustre-discuss [mailto:lustre-discuss-bounces at lists.lustre.org]
>> On Behalf Of Ben Evans
>> > Sent: Tuesday, July 25, 2017 10:13 AM
>> > To: John Casu <john at chiraldynamics.com>; lustre-discuss at lists.lustre.or
>> g
>> > Subject: Re: [lustre-discuss] Install issues on 2.10.0
>> >
>> > health_check moved to /sys/fs/lustre/ along with a bunch of other
>> things.
>> >
>> > -Ben
>> >
>> > On 7/25/17, 12:21 PM, "lustre-discuss on behalf of John Casu"
>> > <lustre-discuss-bounces at lists.lustre.org on behalf of
>> john at chiraldynamics.com> wrote:
>> >
>> >> Just installed latest 2.10.0 Lustre over ZFS on a vanilla Centos
>> >> 7.3.1611 system, using dkms.
>> >> ZFS is 0.6.5.11 from zfsonlinux.org, installed w. yum
>> >>
>> >> Not a single problem during installation, but I am having issues
>> >> building a lustre filesystem:
>> >> 1. Building a separate mgt doesn't seem to work properly, although the
>> >> mgt/mdt combo
>> >>     seems to work just fine.
>> >> 2. I get spl_hostid not set warnings, which I've never seen before 3.
>> >> /proc/fs/lustre/health_check seems to be missing.
>> >>
>> >> thanks,
>> >> -john c
>> >>
>> >>
>> >>
>> >> ---------
>> >> Building an mgt by itself doesn't seem to work properly:
>> >>
>> >>> [root at fb-lts-mds0 x86_64]# mkfs.lustre --reformat --mgs
>> >>> --force-nohostid --servicenode=192.168.98.113 at tcp \
>> >>>                                         --backfstype=zfs mgs/mgt
>> >>>
>> >>>     Permanent disk data:
>> >>> Target:     MGS
>> >>> Index:      unassigned
>> >>> Lustre FS:
>> >>> Mount type: zfs
>> >>> Flags:      0x1064
>> >>>                (MGS first_time update no_primnode ) Persistent mount
>> >>> opts:
>> >>> Parameters: failover.node=192.168.98.113 at tcp
>> >>> WARNING: spl_hostid not set. ZFS has no zpool import protection
>> >>> mkfs_cmd = zfs create -o canmount=off -o xattr=sa mgs/mgt
>> >>> WARNING: spl_hostid not set. ZFS has no zpool import protection
>> >>> Writing mgs/mgt properties
>> >>>    lustre:failover.node=192.168.98.113 at tcp
>> >>>    lustre:version=1
>> >>>    lustre:flags=4196
>> >>>    lustre:index=65535
>> >>>    lustre:svname=MGS
>> >>> [root at fb-lts-mds0 x86_64]# mount.lustre mgs/mgt /mnt/mgs
>> >>> WARNING: spl_hostid not set. ZFS has no zpool import protection
>> >>>
>> >>> mount.lustre FATAL: unhandled/unloaded fs type 0 'ext3'
>> >>
>> >> If I build the combo mgt/mdt, things go a lot better:
>> >>
>> >>>
>> >>> [root at fb-lts-mds0 x86_64]# mkfs.lustre --reformat --mgs --mdt
>> >>> --force-nohostid --servicenode=192.168.98.113 at tcp --backfstype=zfs
>> >>> --index=0 --fsname=test meta/meta
>> >>>
>> >>>     Permanent disk data:
>> >>> Target:     test:MDT0000
>> >>> Index:      0
>> >>> Lustre FS:  test
>> >>> Mount type: zfs
>> >>> Flags:      0x1065
>> >>>                (MDT MGS first_time update no_primnode )  Persistent
>> >>> mount opts:
>> >>> Parameters: failover.node=192.168.98.113 at tcp
>> >>> WARNING: spl_hostid not set. ZFS has no zpool import protection
>> >>> mkfs_cmd = zfs create -o canmount=off -o xattr=sa meta/meta
>> >>> WARNING: spl_hostid not set. ZFS has no zpool import protection
>> >>> Writing meta/meta properties
>> >>>    lustre:failover.node=192.168.98.113 at tcp
>> >>>    lustre:version=1
>> >>>    lustre:flags=4197
>> >>>    lustre:index=0
>> >>>    lustre:fsname=test
>> >>>    lustre:svname=test:MDT0000
>> >>> [root at fb-lts-mds0 x86_64]# mount.lustre meta/meta  /mnt/meta
>> >>> WARNING: spl_hostid not set. ZFS has no zpool import protection
>> >>> [root at fb-lts-mds0 x86_64]# df
>> >>> Filesystem          1K-blocks    Used Available Use% Mounted on
>> >>> /dev/mapper/cl-root  52403200 3107560  49295640   6% /
>> >>> devtmpfs             28709656       0  28709656   0% /dev
>> >>> tmpfs                28720660       0  28720660   0% /dev/shm
>> >>> tmpfs                28720660   17384  28703276   1% /run
>> >>> tmpfs                28720660       0  28720660   0% /sys/fs/cgroup
>> >>> /dev/sdb1             1038336  195484    842852  19% /boot
>> >>> /dev/mapper/cl-home  34418260   32944  34385316   1% /home
>> >>> tmpfs                 5744132       0   5744132   0% /run/user/0
>> >>> meta                 60435328     128  60435200   1% /meta
>> >>> meta/meta            59968128    4992  59961088   1% /mnt/meta
>> >>> [root at fb-lts-mds0 ~]# ls /proc/fs/lustre/mdt/test-MDT0000/
>> >>> async_commit_count     hash_stats               identity_upcall
>> >>> num_exports         sync_count
>> >>> commit_on_sharing      hsm                      instance
>> >>> recovery_status     sync_lock_cancel
>> >>> enable_remote_dir      hsm_control              ir_factor
>> >>> recovery_time_hard  uuid
>> >>> enable_remote_dir_gid  identity_acquire_expire  job_cleanup_interval
>> >>> recovery_time_soft
>> >>> evict_client           identity_expire          job_stats
>> >>> rename_stats
>> >>> evict_tgt_nids         identity_flush           md_stats
>> >>> root_squash
>> >>> exports                identity_info            nosquash_nids
>> >>> site_stats
>> >>
>> >> Also, there's no /proc/fs/lustre/health_check
>> >>
>> >>> [root at fb-lts-mds0 ~]# ls /proc/fs/lustre/
>> >>> fld   llite  lod  lwp  mdd  mdt  mgs      osc      osp  seq
>> >>> ldlm  lmv    lov  mdc  mds  mgc  nodemap  osd-zfs  qmt  sptlrpc
>> >>
>> >>
>> >>
>> >>
>> >> _______________________________________________
>> >> lustre-discuss mailing list
>> >> lustre-discuss at lists.lustre.org
>> >> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>> >
>> > _______________________________________________
>> > lustre-discuss mailing list
>> > lustre-discuss at lists.lustre.org
>> > http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>> >
>> >
>>
>>
>> ------------------------------
>>
>> Message: 2
>> Date: Tue, 25 Jul 2017 13:09:52 -0500
>> From: Joakim Ziegler <joakim at terminalmx.com>
>> To: lustre-discuss at lists.lustre.org
>> Subject: [lustre-discuss] How does Lustre client side caching work?
>> Message-ID:
>>         <CABkNrDaecBOJSpUtYJ-Fz5+-8NBB4QqRWtCZEHSPP0y=1=befQ at mail.
>> gmail.com>
>> Content-Type: text/plain; charset="utf-8"
>>
>> Hello, I'm pretty new to Lustre, we're looking at setting up a Lustre
>> cluster for storage of media assets (something in the 0.5-1PB range to
>> start with, maybe 6 OSSes (in HA pairs), running on our existing FDR IB
>> network). It looks like a good match for our needs, however, there's an
>> area I've been unable to find details about. Note that I'm just
>> investigating for now, I have no running Lustre setup.
>>
>> There are plenty of references to Lustre using client side caching, and
>> how
>> the Distributed Lock Manager makes this work. However, I can't find almost
>> any information about how the client side cache actually works. When I
>> first heard it mentioned, I imagined something like the ZFS L2ARC, where
>> you can add a device (say, a couple of SSDs) to the client and point
>> Lustre
>> at it to use it for caching. But some references I come across just talk
>> about the normal kernel page cache, which is probably smaller and less
>> persistent than what I'd like for our usage.
>>
>> Could anyone enlighten me? I have a large dataset, but clients typically
>> use a small part of it at any given time, and uses it quite intensively,
>> so
>> a client-side cache (either a read cache or ideally a writeback cache)
>> would likely reduce network traffic and server load quite a bit. We've
>> been
>> using NFS over RDMA and fscache to get a read cache that does roughly this
>> so far on our existing file servers, and it's been quite effective, so I
>> imagine we could also benefit from something similar as we move to Lustre.
>>
>> --
>> Joakim Ziegler  -  Supervisor de postproducci?n  -  Terminal
>> joakim at terminalmx.com   -   044 55 2971 8514   -   5264 0864
>> -------------- next part --------------
>> An HTML attachment was scrubbed...
>> URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.
>> org/attachments/20170725/72452936/attachment.html>
>>
>> ------------------------------
>>
>> Message: 3
>> Date: Tue, 25 Jul 2017 11:52:43 -0700
>> From: "Nathan R.M. Crawford" <nrcrawfo at uci.edu>
>> To: lustre-discuss at lists.lustre.org
>> Subject: [lustre-discuss] LNET router (2.10.0) recommendations for
>>         heterogeneous (mlx5, qib) IB setup
>> Message-ID:
>>         <CAO6--ykDsL-VyqJK6=gPHdAY1psmx=mb-C25BeVD--cpztSWFQ at mail.
>> gmail.com>
>> Content-Type: text/plain; charset="utf-8"
>>
>> Hi All,
>>
>>   We are gradually updating a cluster (OS, etc.) in-place, basically
>> switching blocks of nodes from the old head node to the new. Until we can
>> re-arrange the fabric at the next scheduled machine room power shutdown
>> event, we are running two independent Infiniband subnets. As I can't find
>> useful documentation on proper IB routing between subnets, I have
>> configured one node with an HCA on each IB subnet that does simple IPoIB
>> routing and LNET routing.
>>
>> Brief description:
>>   Router node has 24 cores, 128GB RAM, and is running with the in-kernel
>> IB
>> drivers from Centos7.3. It connects to the new IB fabric via a Mellanox
>> EDR
>> card (MT4115) on ib0, and to the old via a Truescale QDR card (QLE7340) on
>> ib1. The old IB is on 10.2.0.0/16 (o2ib0), and the new is 10.201.32.0/19
>> (o2ib1 <http://10.201.32.0/19(o2ib1>).
>>
>>   The new 2.10.0 server is on the EDR side, and the old 2.8.0 server is on
>> the QDR side. Nodes with QDR HCAs already coexist with EDR nodes on the
>> EDR
>> subnet without problems.
>>
>> All Lustre config via /etc/lnet.conf:
>> #####
>> net:
>>     - net type: o2ib1
>>       local NI(s):
>>         - nid: 10.201.32.11 at o2ib1
>>           interfaces:
>>               0: ib0
>>           tunables:
>>               peer_timeout: 180
>>               peer_credits: 62
>>               peer_buffer_credits: 512
>>               credits: 1024
>>           lnd tunables:
>>               peercredits_hiw: 64
>>               map_on_demand: 256
>>               concurrent_sends: 62
>>               fmr_pool_size: 2048
>>               fmr_flush_trigger: 512
>>               fmr_cache: 1
>>               ntx: 2048
>>     - net type: o2ib0
>>       local NI(s):
>>         - nid: 10.2.1.22 at o2ib0
>>           interfaces:
>>               0: ib1
>>           tunables:
>>               peer_timeout: 180
>>               peer_credits: 8
>>               peer_buffer_credits: 512
>>               credits: 1024
>>           lnd tunables:
>>               map_on_demand: 32
>>               concurrent_sends: 16
>>               fmr_pool_size: 2048
>>               fmr_flush_trigger: 512
>>               fmr_cache: 1
>>               ntx: 2048
>> routing:
>>     - small: 16384
>>       large: 2048
>>       enable: 1
>> ####
>>
>>   While the setup works, I had to drop peer_credits to 8 on the QDR side
>> to
>> avoid long periods of stalled traffic. It is probably going to be adequate
>> for the remaining month before total shutdown and removal of routers, but
>> I
>> would still like to have a better solution in hand.
>>
>> Questions:
>> 1) Is there a well-known good config for a qib<-->mlx5 LNET router?
>> 2) Where should I look to identify the cause of stalled traffic, which
>> still appears at higher load?
>> 3) What parameters should I be playing with to optimize the router?
>>
>> Thanks,
>> Nate
>>
>>
>>
>> --
>>
>> Dr. Nathan Crawford              nathan.crawford at uci.edu
>> Modeling Facility Director
>> Department of Chemistry
>> 1102 Natural Sciences II         Office: 2101 Natural Sciences II
>> University of California, Irvine  Phone: 949-824-4508
>> Irvine, CA 92697-2025, USA
>> -------------- next part --------------
>> An HTML attachment was scrubbed...
>> URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.
>> org/attachments/20170725/8727c755/attachment-0001.htm>
>>
>> ------------------------------
>>
>> Subject: Digest Footer
>>
>> _______________________________________________
>> lustre-discuss mailing list
>> lustre-discuss at lists.lustre.org
>> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>>
>>
>> ------------------------------
>>
>> End of lustre-discuss Digest, Vol 136, Issue 26
>> ***********************************************
>>
>
>
> _______________________________________________
> lustre-discuss mailing list
> lustre-discuss at lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>
>

-- 

Dr. Nathan Crawford              nathan.crawford at uci.edu
Modeling Facility Director
Department of Chemistry
1102 Natural Sciences II         Office: 2101 Natural Sciences II
University of California, Irvine  Phone: 949-824-4508
Irvine, CA 92697-2025, USA
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20170731/12472ecb/attachment-0001.htm>