[Lustre-devel] hiding non-fatal communications errors

Oleg Drokin Oleg.Drokin at Sun.COM
Thu Jun 5 20:38:02 PDT 2008


Hello!

    Because there is no way to deliver them. We send our first  
acknowledge of ast reception and it is delivered fast, this is the  
reply.
    Now what left is to send actual dirty data and then cancel  
request. These are not replies, but stand-alone client-generated RPCs,
    we cannot cancel locks while dirty data is not flushed. Just  
inventing some sort of ldlm "I am still alive" RPCs to send periodically
    instead of cancels is dangerous - data-sending part could be  
wedged for unrelated reasons, for example, not only because of  
contention, but due
    to some client problems, and if we prolong locks by other means,  
that potentially can wedge all access to that part of a file forever.
    And dirty data itself takes too long to get to the actual server  
processing.
    On of the solutions here is request scheduler, or some stand-alone  
part of it that could peek early into RPCs as they arrive, so that
    when the decision is being made about client eviction, we can  
quickly see what is in the queue from that client and perhaps
    based on this data to postpone the eviction. This was discussed on  
ORNL call.
    Andreas said that AT is currently already looking into incoming  
RPCs before processing, to get ideas about expected service times,  
perhaps
    it would not be too hard to add some logic that would link  
requests into actual exports they came from for further analysis if  
the need for
    it arises.

Bye,
     Oleg
On Jun 5, 2008, at 11:29 PM, Peter Braam wrote:

> Why can we not send early replies?
>
>
> On 6/5/08 9:59 AM, "Oleg Drokin" <Oleg.Drokin at Sun.COM> wrote:
>
>> Hello!
>>
>> On Jun 5, 2008, at 12:42 PM, Robert Read wrote:
>>
>>>>> I suspect this could be adapted to allowing a fixed number of
>>>>> retries for
>>>>> server-originated RPCs also.  In the case of LDLM blocking  
>>>>> callbacks
>>>>> sent
>>>>> to a client, a resend is currently harmless (either the client is
>>>>> already
>>>>> processing the callback, or the lock was cancelled).
>>>> We need to be careful here and decide on a good strategy on when to
>>>> resend.
>>>> E.g. recent case at ORNL (even if a bit pathologic) is they pound
>>>> through
>>>> thousands of clients to 4 OSSes via 2 routers. That creates request
>>>> waiting
>>>> lists on OSSes well into tens of thousands. When we block on a lock
>>>> and send
>>>> blocking AST to the client, it quickly turns around and puts in his
>>>> data...
>>>> at the end of our list that takes hundreds of seconds (more than
>>>> obd_timeout,
>>>> obviously). No matter how much you resend, it won't help.
>>> This looks like the poster child for adaptive timeouts, although we
>>> might want need some version of the early margin update patch on
>>> 15501.  Have you tried enabling AT?
>>
>> The problem is AT does not handle this specific case, there is no  
>> way to
>> deliver "early replay" from a client to server that "I am working on
>> it" outside of
>> just sending dirty data. But dirty data gets into a queue for way too
>> long.
>> There re no timed out requests, the only thing timing out is lock  
>> that
>> is not
>> cancelled in time.
>> AT was not tried - this is hard to do at ORNL, as client side is Cray
>> XT4 machine,
>> and updating clients is hard. So they are on 1.4.11 of some sort.
>> They can easily update servers, but this won't help, of course.
>>
>>> Maybe that's was done to discourage people from disabling AT?
>>> Seriously, though, I don't know why that was changed. Perhaps it was
>>> done on b1_6 before to AT landed?
>>
>> hm, indeed. I see this change in 1.6.3.
>>
>> Bye,
>>     Oleg
>> _______________________________________________
>> Lustre-devel mailing list
>> Lustre-devel at lists.lustre.org
>> http://lists.lustre.org/mailman/listinfo/lustre-devel
>
>




More information about the lustre-devel mailing list