[Lustre-discuss] write RPC & congestion

Wed Dec 22 13:47:06 PST 2010

Hello!

On Dec 22, 2010, at 8:25 AM, Jeremy Filizetti wrote:
>> Well, I don't think this should matter anyhow. Since we send the RPCs in async manner in parallel anyway, the latency of bulk descriptor get is not adding up.
> It does make a difference before read ahead has kicked in.  There is a big difference in how things start off even though for sequential load they all peak at the same value.  For instance if I was just accessing a portion of a file or seeking around and reading a few MBs instead of reading it all sequentially  this has a big impact IIRC.

Right, I can see where large RPCs would help in case of reads while readahead picks up the pace.
But the same this should not happen with writes since the client is the sole originator of the writes.

>> How does your rtt delay gets introduced for the test? Could it be that if there are more messages on the wire even at the same time, they are delayed more (aside from obvious bandwidth-induced delay, like bottlenecking a single message at a time with mandatory delay or something like this)?
> The RTT delay is handled by the Obsidian Longbow and was set symmetric to half the RTT on each end (I think they delay receive traffic only not transmit).  The data at 110+ ms seems to have a larger decay then the rest of the data so I'm not sure if someone was happening before that but the rest of the data seems consistent with what I've seen using real distance and not simulated delay, although I only have points to compare it to and not whole range of 0-200ms.  

I think it would be also interesting to run a similar test over a genuinely high latency link.

>> Probably the difference is the one of GET vs PUT semantic in lnet. there's going to be at least 2 RTTs in any case. One RTT is the "header" RPC that tells OST "hey, I am doing this operation
>> here that involves bulk io, it has this many pages and the descriptor is so and so", then server does another RTT to actually fetch/push the data (and that might actually be worse than one for one of the GET/PUT case I guess?)
> I need to look at this more but it seems to me that the read case should still be capable of completing in 1 RTT because the server can send the response as soon as it gets the request since all the MD info should be included with the request?

Well, while theoretically that might be the case, with lustre as it is right now BULK RPC is a two phase process where 1 RTT is to transmit "metadata" of sorts that describes the IO in one direction and then returns the io status back and then
the other RTT is to actually transfer the data over the wire in one of the directions. 

>> Would be interesting to confirm the amount of RPCs actually being processed on the server at any one time, I think.
>> Did you try the direct IO too? Some older version of lustre used to send all outstanding directio RPCs in parallel, so if you did your IO as just a single direct IO write, the latency of that write should be around a couple of RTTs. I think that we still do this even in 1.8.5, so it would make an interesting comparison.
> I didn't try any direct IO but I certainly could.

Thanks.

Please keep us informed if you see anything else interesting.

Bye,
    Oleg