[Lustre-discuss] write RPC & congestion

Tue Dec 21 22:32:23 PST 2010

Hello!

On Dec 22, 2010, at 12:43 AM, Jeremy Filizetti wrote:

> In the attachment I created that Andreas posted at https://bugzilla.lustre.org/attachment.cgi?id=31423 if you look at graph 1 and 2 they are both using larger than default max_rpcs_in_flight.  I believe the data without the patch from bug 16900 had max_rpcs_in_flight=42.  For the data with the patch from 16900 used max_rpcs_in_flight=32.  So the short answer is we are already increasing max_rpcs_in_flight for all of that data (which is needed for good performance at higher latencies).

Ah! This should have been noted somewhere.
Well, it's still unfair then! ;)
You see, each OSC can cache up to 32mb of dirty data by default (max_dirty_mb osc setting in /proc).
So when you have 4M RPCs, you actually use only 8 RPCs to transfer your entire allotment of dirty pages where as you use 32 for 1M RPCs (and so setting it any higher has no effect unless you also bump max dirty mb). Of course this will only affect the write RPCs, not read.

> My understanding of what the real benefit is from the larger RPC patch is that we are not having to face 12 round-trip-times to read 4 MB (4 - 1 MB bulk RPCs) instead I think we have 3.  Although I've never traced through to see this is actually what is happening.  But from what I read about the patch it sends 4 memory descriptors with a single bulk request.

Well, I don't think this should matter anyhow. Since we send the RPCs in async manner in parallel anyway, the latency of bulk descriptor get is not adding up.
Due to that the results you've got should have been much closer together too. I wonder what other factors played a role here?
I see you only had single client, so it's not like you were able to overwhelm the number of OSS threads running. Even with the case of 6 OSTs per oss assuming all 42 RPCs were in flight, that's still only 252 RPCs. did you make sure that that's the number of threads you had running by any chance?
How does your rtt delay gets introduced for the test? Could it be that if there are more messages on the wire even at the same time, they are delayed more (aside from obvious bandwidth-induced delay, like bottlenecking a single message at a time with mandatory delay or something like this)?

> What isn't quite clear to me is why Lustre takes 3 RTT for a read and 2 for a write.  I think I understand the write having to communicate once with the server because preallocating buffers for all clients would possible be a waste of resources.  But for reading it seems logical (from the RDMA stand point) that the memory buffer could be pre-registered and send to the server and the server would respond back with the contents for that buffer for a read which would be 1 RTT.

Probably the difference is the one of GET vs PUT semantic in lnet. there's going to be at least 2 RTTs in any case. One RTT is the "header" RPC that tells OST "hey, I am doing this operation
here that involves bulk io, it has this many pages and the descriptor is so and so", then server does another RTT to actually fetch/push the data (and that might actually be worse than one for one of the GET/PUT case I guess?)

> I don't have everything setup right now in our test environment but with a little effort I could setup a similar test if your wondering about something specific.

Would be interesting to confirm the amount of RPCs actually being processed on the server at any one time, I think.
Did you try the direct IO too? Some older version of lustre used to send all outstanding directio RPCs in parallel, so if you did your IO as just a single direct IO write, the latency of that write should be around a couple of RTTs. I think that we still do this even in 1.8.5, so it would make an interesting comparison.

Bye,
    Oleg