[Lustre-discuss] [PATCH] Avoid Lustre failure on temporary failure

Tue Sep 2 09:17:00 PDT 2014

On 2014/09/02, 7:05 AM, "Alexey Lyashkov" <alexey_lyashkov at xyratex.com>
wrote:

>we don’t need too much sends to single peer, except of LNet routers.
>as about other limits
>Number of RPC in flight == 1 for MDC<>MDT links,

Just a minor correction - while there is currently a limit of 1 modifying
RPC in flight for MDC-MDT, there may be up to 8 non-modifying RPCs in
flight (readdir, stat, statfs) to the MDT.  Also, there is work underway
to allow multiple modifying RPCs in flight to the MDT (LU-5319 proposed
for discussion at LAD).

>and isn’t more 32 for OST, but we have limited to the 512 OST_IO threads.
>
>about credits - number of credits used in LNet calculation - should
>depend of buffers posted to incoming process and that number of buffers
>should depend of performance results - like number of RPC processed in
>some time. 
>It’s avoid over buffering in all places, but it open a question about
>credits distribution over cluster.

There was some research done a few years ago by Yingjin Qian about having
server-side control over RPCs in flight, instead of having a static
tunable at the client.  This showed some improvement in performance,
especially at the transition when the workload is changing.  However, no
work was ever done to integrate this into a Lustre release.

http://www.computer.org/csdl/proceedings/msst/2013/0217/00/06558432-abs.htm
l
http://storageconference.us/2013/Presentations/Yimo.pdf

Cheers, Andreas

>On Sep 2, 2014, at 4:40 PM, Zhen, Liang <liang.zhen at intel.com> wrote:
>
>> Precisely ³credit² should be concurrent sends (of ko2iblnd message) to a
>> single peer, it is not number of inflight Lustre RPCs. I understand the
>> memory issue of this, and by enabling map_on_demand, ko2iblnd will
>>create
>> FMR for large fragments bulk IO (for example, 32+ fragments or 128K+),
>>and
>> only allow small IOs to use current way and avoid overhead of creating
>> FMR, then we have up to 32 fragments and QP size is only 1/8 of now.
>> 
>> Regards
>> Liang
>> 
>> On 9/2/14, 6:09 PM, "Alexey Lyashkov" <alexey_lyashkov at xyratex.com>
>>wrote:
>> 
>>> credits for Lustre ? it¹s works? now it¹s strange number without
>>>relation
>>> to real network structure and produce over buffering issues on server
>>> side.
>>> 
>>> On Sep 2, 2014, at 12:22 PM, Zhen, Liang <liang.zhen at intel.com> wrote:
>>> 
>>>> Yes, I think this is the potential issue of this patch, for each 1M
>>>> data lustre has 256 fragments (256 pages) on 4K pagesize system, which
>>>> means we can have max to (credits X 256) outstanding work requests for
>>>> each connection, decreasing max_send_wr may hit ib_post_send() failure
>>>> under heavy workload.
>>>> 
>>>> I understand this may be a problem for low level stack to allocate big
>>>> chunk of space, and cause memory allocating failures. The solution is
>>>> enabling map_on_demand and use FMR, however, enabling this on some
>>>>nodes
>>>> will prevent them to join cluster if other nodes have no
>>>>map_on_demand,
>>>> we already have a patch for this which is pending on review, please
>>>> check this (LU-3322)
>>>> 
>>>> Thanks
>>>> Liang
>>>> 
>>>> From: David McMillen <mcmillen at cray.com<mailto:mcmillen at cray.com>>
>>>> Date: Sunday, August 31, 2014 at 6:48 PM
>>>> To: 
>>>> 
>>>>"lustre-discuss at lists.lustre.org<mailto:lustre-discuss at lists.lustre.org
>>>>>"
>>>> 
>>>> 
>>>><lustre-discuss at lists.lustre.org<mailto:lustre-discuss at lists.lustre.org
>>>>>>
>>>> , Eli Cohen <eli at dev.mellanox.co.il<mailto:eli at dev.mellanox.co.il>>
>>>> Subject: Re: [Lustre-discuss] [PATCH] Avoid Lustre failure on
>>>>temporary
>>>> failure
>>>> 
>>>> Has this been tested with a significant I/O load?  We had tried a
>>>> similar approach but ran into subsequent errors and connection drops
>>>> when the ib_post_send() failed.  The code assumes that the original
>>>> init_qp_attr->cap.max_send_wr value succeeded.  Is there a second part
>>>> to this patch?
>>>> 
>>>> Dave
>>>> 
>>>> On Sun, Aug 31, 2014 at 2:53 AM, Eli Cohen
>>>> <eli at dev.mellanox.co.il<mailto:eli at dev.mellanox.co.il>> wrote:
>>>> 
>>>>> Lustre code tries to create a QP with max_send_wr which depends on a
>>>>> module
>>>>> parameter.  The device capabilities do provide the maximum number of
>>>>> send work
>>>>> requests that the device supports but the actual number of work
>>>>> requests that
>>>>> can be supported in a specific case depends on other characteristics
>>>>> of the
>>>>> work queue, the transport type, etc. This is in compliance with the
>>>>>IB
>>>>> spec:
>>>>> 
>>>>> 11.2.1.2 QUERY HCA
>>>>> Description:
>>>>> Returns the attributes for the specified HCA.
>>>>> The maximum values defined in this section are guaranteed
>>>>> not-to-exceed values. It is possible for an implementation to
>>>>>allocate
>>>>> some HCA resources from the same space. In that case, the maximum
>>>>> values returned are not guaranteed for all of those resources
>>>>> simultaneously.
>>>>> 
>>>>> This patch tries to decrease the number of requested work requests to
>>>>> a level
>>>>> that can be supported by the HCA. This prevents unnecessary failures.
>>>>> 
>>>>> Signed-off-by: Eli Cohen <eli at mellanox.com>
>>>>> ---
>>>>> lnet/klnds/o2iblnd/o2iblnd.c | 25 ++++++++++++++++++-------
>>>>> 1 file changed, 18 insertions(+), 7 deletions(-)
>>>>> 
>>>>> diff --git a/lnet/klnds/o2iblnd/o2iblnd.c
>>>>> b/lnet/klnds/o2iblnd/o2iblnd.c
>>>>> index 4061db00cba2..ef1c6e07cb45 100644
>>>>> --- a/lnet/klnds/o2iblnd/o2iblnd.c
>>>>> +++ b/lnet/klnds/o2iblnd/o2iblnd.c
>>>>> @@ -736,6 +736,7 @@ kiblnd_create_conn(kib_peer_t *peer, struct
>>>>> rdma_cm_id *cmid,
>>>>>     int                     cpt;
>>>>>     int                     rc;
>>>>>     int                     i;
>>>>> +     int                     orig_wr;
>>>>> 
>>>>>     LASSERT(net != NULL);
>>>>>     LASSERT(!in_interrupt());
>>>>> @@ -862,13 +863,23 @@ kiblnd_create_conn(kib_peer_t *peer, struct
>>>>> rdma_cm_id *cmid,
>>>>> 
>>>>>     conn->ibc_sched = sched;
>>>>> 
>>>>> -        rc = rdma_create_qp(cmid, conn->ibc_hdev->ibh_pd,
>>>>> init_qp_attr);
>>>>> -        if (rc != 0) {
>>>>> -                CERROR("Can't create QP: %d, send_wr: %d, recv_wr:
>>>>> %d\n",
>>>>> -                       rc, init_qp_attr->cap.max_send_wr,
>>>>> -                       init_qp_attr->cap.max_recv_wr);
>>>>> -                goto failed_2;
>>>>> -        }
>>>>> +     orig_wr = init_qp_attr->cap.max_send_wr;
>>>>> +     do {
>>>>> +             rc = rdma_create_qp(cmid, conn->ibc_hdev->ibh_pd,
>>>>> init_qp_attr);
>>>>> +             if (!rc || init_qp_attr->cap.max_send_wr < 16)
>>>>> +                     break;
>>>>> +
>>>>> +             init_qp_attr->cap.max_send_wr /= 2;
>>>>> +     } while (rc);
>>>>> +     if (rc != 0) {
>>>>> +             CERROR("Can't create QP: %d, send_wr: %d, recv_wr:
>>>>>%d\n",
>>>>> +                    rc, init_qp_attr->cap.max_send_wr,
>>>>> +                    init_qp_attr->cap.max_recv_wr);
>>>>> +             goto failed_2;
>>>>> +     }
>>>>> +     if (orig_wr != init_qp_attr->cap.max_send_wr)
>>>>> +             pr_info("original send wr %d, created with %d\n",
>>>>> +                     orig_wr, init_qp_attr->cap.max_send_wr);
>>>>> 
>>>>>        LIBCFS_FREE(init_qp_attr, sizeof(*init_qp_attr));
>>>>> 
>>>> 
>>>> _______________________________________________
>>>> Lustre-discuss mailing list
>>>> Lustre-discuss at lists.lustre.org
>>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>> 
>> 
>
>_______________________________________________
>Lustre-discuss mailing list
>Lustre-discuss at lists.lustre.org
>http://lists.lustre.org/mailman/listinfo/lustre-discuss
>

Cheers, Andreas
-- 
Andreas Dilger

Lustre Software Architect
Intel High Performance Data Division