[Lustre-discuss] [PATCH] Avoid Lustre failure on temporary failure
Zhen, Liang
liang.zhen at intel.com
Tue Sep 2 05:40:31 PDT 2014
Precisely ³credit² should be concurrent sends (of ko2iblnd message) to a
single peer, it is not number of inflight Lustre RPCs. I understand the
memory issue of this, and by enabling map_on_demand, ko2iblnd will create
FMR for large fragments bulk IO (for example, 32+ fragments or 128K+), and
only allow small IOs to use current way and avoid overhead of creating
FMR, then we have up to 32 fragments and QP size is only 1/8 of now.
Regards
Liang
On 9/2/14, 6:09 PM, "Alexey Lyashkov" <alexey_lyashkov at xyratex.com> wrote:
>credits for Lustre ? it¹s works? now it¹s strange number without relation
>to real network structure and produce over buffering issues on server
>side.
>
>On Sep 2, 2014, at 12:22 PM, Zhen, Liang <liang.zhen at intel.com> wrote:
>
>> Yes, I think this is the potential issue of this patch, for each 1M
>>data lustre has 256 fragments (256 pages) on 4K pagesize system, which
>>means we can have max to (credits X 256) outstanding work requests for
>>each connection, decreasing max_send_wr may hit ib_post_send() failure
>>under heavy workload.
>>
>> I understand this may be a problem for low level stack to allocate big
>>chunk of space, and cause memory allocating failures. The solution is
>>enabling map_on_demand and use FMR, however, enabling this on some nodes
>>will prevent them to join cluster if other nodes have no map_on_demand,
>>we already have a patch for this which is pending on review, please
>>check this (LU-3322)
>>
>> Thanks
>> Liang
>>
>> From: David McMillen <mcmillen at cray.com<mailto:mcmillen at cray.com>>
>> Date: Sunday, August 31, 2014 at 6:48 PM
>> To:
>>"lustre-discuss at lists.lustre.org<mailto:lustre-discuss at lists.lustre.org>"
>>
>><lustre-discuss at lists.lustre.org<mailto:lustre-discuss at lists.lustre.org>>
>>, Eli Cohen <eli at dev.mellanox.co.il<mailto:eli at dev.mellanox.co.il>>
>> Subject: Re: [Lustre-discuss] [PATCH] Avoid Lustre failure on temporary
>>failure
>>
>> Has this been tested with a significant I/O load? We had tried a
>>similar approach but ran into subsequent errors and connection drops
>>when the ib_post_send() failed. The code assumes that the original
>>init_qp_attr->cap.max_send_wr value succeeded. Is there a second part
>>to this patch?
>>
>> Dave
>>
>> On Sun, Aug 31, 2014 at 2:53 AM, Eli Cohen
>><eli at dev.mellanox.co.il<mailto:eli at dev.mellanox.co.il>> wrote:
>>
>>> Lustre code tries to create a QP with max_send_wr which depends on a
>>>module
>>> parameter. The device capabilities do provide the maximum number of
>>>send work
>>> requests that the device supports but the actual number of work
>>>requests that
>>> can be supported in a specific case depends on other characteristics
>>>of the
>>> work queue, the transport type, etc. This is in compliance with the IB
>>>spec:
>>>
>>> 11.2.1.2 QUERY HCA
>>> Description:
>>> Returns the attributes for the specified HCA.
>>> The maximum values defined in this section are guaranteed
>>> not-to-exceed values. It is possible for an implementation to allocate
>>> some HCA resources from the same space. In that case, the maximum
>>> values returned are not guaranteed for all of those resources
>>> simultaneously.
>>>
>>> This patch tries to decrease the number of requested work requests to
>>>a level
>>> that can be supported by the HCA. This prevents unnecessary failures.
>>>
>>> Signed-off-by: Eli Cohen <eli at mellanox.com>
>>> ---
>>> lnet/klnds/o2iblnd/o2iblnd.c | 25 ++++++++++++++++++-------
>>> 1 file changed, 18 insertions(+), 7 deletions(-)
>>>
>>> diff --git a/lnet/klnds/o2iblnd/o2iblnd.c
>>>b/lnet/klnds/o2iblnd/o2iblnd.c
>>> index 4061db00cba2..ef1c6e07cb45 100644
>>> --- a/lnet/klnds/o2iblnd/o2iblnd.c
>>> +++ b/lnet/klnds/o2iblnd/o2iblnd.c
>>> @@ -736,6 +736,7 @@ kiblnd_create_conn(kib_peer_t *peer, struct
>>>rdma_cm_id *cmid,
>>> int cpt;
>>> int rc;
>>> int i;
>>> + int orig_wr;
>>>
>>> LASSERT(net != NULL);
>>> LASSERT(!in_interrupt());
>>> @@ -862,13 +863,23 @@ kiblnd_create_conn(kib_peer_t *peer, struct
>>>rdma_cm_id *cmid,
>>>
>>> conn->ibc_sched = sched;
>>>
>>> - rc = rdma_create_qp(cmid, conn->ibc_hdev->ibh_pd,
>>>init_qp_attr);
>>> - if (rc != 0) {
>>> - CERROR("Can't create QP: %d, send_wr: %d, recv_wr:
>>>%d\n",
>>> - rc, init_qp_attr->cap.max_send_wr,
>>> - init_qp_attr->cap.max_recv_wr);
>>> - goto failed_2;
>>> - }
>>> + orig_wr = init_qp_attr->cap.max_send_wr;
>>> + do {
>>> + rc = rdma_create_qp(cmid, conn->ibc_hdev->ibh_pd,
>>>init_qp_attr);
>>> + if (!rc || init_qp_attr->cap.max_send_wr < 16)
>>> + break;
>>> +
>>> + init_qp_attr->cap.max_send_wr /= 2;
>>> + } while (rc);
>>> + if (rc != 0) {
>>> + CERROR("Can't create QP: %d, send_wr: %d, recv_wr: %d\n",
>>> + rc, init_qp_attr->cap.max_send_wr,
>>> + init_qp_attr->cap.max_recv_wr);
>>> + goto failed_2;
>>> + }
>>> + if (orig_wr != init_qp_attr->cap.max_send_wr)
>>> + pr_info("original send wr %d, created with %d\n",
>>> + orig_wr, init_qp_attr->cap.max_send_wr);
>>>
>>> LIBCFS_FREE(init_qp_attr, sizeof(*init_qp_attr));
>>>
>>
>> _______________________________________________
>> Lustre-discuss mailing list
>> Lustre-discuss at lists.lustre.org
>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>
More information about the lustre-discuss
mailing list