<html>

<head>

<meta http-equiv="Content-Type" content="text/html; charset=us-ascii">

<meta name="Generator" content="Microsoft Exchange Server">

<!-- converted from text --><style><!-- .EmailQuote { margin-left: 1pt; padding-left: 4pt; border-left: #800000 2px solid; } --></style>

</head>

<body>

<meta content="text/html; charset=UTF-8">

<style type="text/css" style="">

<!--

p

        {margin-top:0;

        margin-bottom:0}

-->

</style>

<div dir="ltr">

<div id="x_divtagdefaultwrapper" dir="ltr" style="font-size:12pt; color:#000000; font-family:Calibri,Helvetica,sans-serif">

<p>Ann,</p>

<br>

<p>I would be happy to help with review, etc, on this once it's ready to be posted.</p>

<p><br>

</p>

<p>In the meantime, <span>I am curious about how you handled the compression and the discontiguous set of pages problem.  Did you use scatter-gather lists like the encryption code does, or some other solution?</span><br>

</p>

<p><br>

</p>

<p>Are you willing/able to share the current code, perhaps even off list?  I certainly understand if not, but I am curious to see how it will work and explore the performance implications.</p>

<p><br>

</p>

<p>- Patrick<br>

</p>

</div>

<hr tabindex="-1" style="display:inline-block; width:98%">

<div id="x_divRplyFwdMsg" dir="ltr"><font face="Calibri, sans-serif" color="#000000" style="font-size:11pt"><b>From:</b> Anna Fuchs <anna.fuchs@informatik.uni-hamburg.de><br>

<b>Sent:</b> Thursday, July 27, 2017 3:26:00 AM<br>

<b>To:</b> Patrick Farrell; Xiong, Jinshan<br>

<b>Cc:</b> Matthew Ahrens; Zhuravlev, Alexey; lustre-devel<br>

<b>Subject:</b> Re: [lustre-devel] Design proposal for client-side compression</font>

<div> </div>

</div>

</div>

<font size="2"><span style="font-size:10pt;">

<div class="PlainText">Patrick, <br>

<br>

> Having reread your LAD presentation (I was there, but it's been a<br>

> while...), I think you've got a good architecture.<br>

<br>

There have been some changes since that, but the general things should<br>

be the same.<br>

<br>

> A few thoughts.<br>

> <br>

> 1. Jinshan was just suggesting including in the code a switch to<br>

> enable/disable the feature at runtime, for an example, see his fast<br>

> read patch:<br>

> <a href="https://review.whamcloud.com/#/c/20255/">https://review.whamcloud.com/#/c/20255/</a><br>

> Especially the proc section:<br>

> <a href="https://review.whamcloud.com/#/c/20255/7/lustre/llite/lproc_llite.c">https://review.whamcloud.com/#/c/20255/7/lustre/llite/lproc_llite.c</a><br>

> The effect of that is a file in proc that one can use to<br>

> disable/enable the feature by echoing 0 or 1.<br>

> (I think there is probably a place for tuning beyond that, but that's<br>

> separate.)<br>

> This is great for features that may have complex impacts, and also<br>

> for people who want to test a feature to see how it changes things.<br>

<br>

Oh, I misunderstood Jinshan last time, sorry. Yes, it would be much<br>

easier for users and should be possible. Thank you for references!<br>

<br>

> 2. Lustre clients iterate over the stripes, basically.<br>

> <br>

> Here's an explanation of the write path on the client that should<br>

> help.  This explanation is heavily simplified and incorrect in some<br>

> of the details, but should be accurate enough for your question.<br>

> The I/O model on the client (for buffered I/O, direct I/O is<br>

> different) is that the writing process (userspace process) starts an<br>

> I/O, then identifies which parts of the I/O go to which stripes, gets<br>

> the locks it needs, then copies the data through the page cache... <br>

> Once the data is copied to the page cache, Lustre then works on<br>

> writing out that data.  In general, it does it asynchronously, where<br>

> the userspace process returns and then data write-out is handled by<br>

> the ptlrpcd (daemon) threads, but in various exceptional conditions<br>

> it may do the write-out in the userspace process.<br>

> <br>

> In general, the write out is going to happen in parallel (to<br>

> different OSTs) with different ptlrpcd threads taking different<br>

> chunks of data and putting them on the wire, and sometimes the<br>

> userspace thread doing that work for some of the data as well.<br>

> <br>

> So "How much memory do we need at most at the same time?" is not a<br>

> question with an easy answer.  When doing a bulk RPC, generally, the<br>

> sender sends an RPC announcing the bulk data is ready, then the<br>

> recipient copies the data (RDMA) (or the sender sends it over to a<br>

> buffer if no RDMA) and announces to the client it has done so.  I'm<br>

> not 100% clear on the sequencing here, but the key thing is there's a<br>

> time where we've sent the RPC but we aren't done with the buffer.  So<br>

> we can send another RPC before that buffer is retired.  (If I've got<br>

> this badly wrong, I hope someone will correct me.<br>

> <br>

> So the total amount of memory required to do this is going to depend<br>

> on how fast data is being sent, rather than on the # of OSTs or any<br>

> other constant.<br>

> <br>

> There *is* a per OST limit to how many RPCs a client can have in<br>

> flight at once, but it's generally set so the client can get good<br>

> performance to one OST.  Allocating data for max_rpcs_in_flight*num<br>

> OSTs would be far too much, because in the 1000 OST case, a client<br>

> can probably only have a few hundred RPCs in flight (if that...) at<br>

> once on a normal network.<br>

> <br>

> But if we are writing from one client to many OSTs, how many RPCs are<br>

> in flight at once is going to depend more on how fast our network is<br>

> (or, possibly, CPU on the client if the network is fast and/or CPU is<br>

> slow) than any explicit limits.  The explicit limits are much higher<br>

> than we will hit in practice.<br>

> <br>

> Does that make sense?  It doesn't make your problem any easier...<br>

<br>

Totally, and you are right, it is more complex than I hoped. <br>

<br>

> <br>

> It actually seems like maybe a global pool of pages *is* the right<br>

> answer.  The question is how big to make it...<br>

> What about making it grow on demand up to a configurable upper limit?<br>

> <br>

> The allocation code for encryption is here (it's pretty complicated<br>

> and it works on the assumption that it must get pages or return<br>

> ENOMEM - The compression code doesn't absolutely have to get pages,<br>

> so it could be changed):<br>

> sptlrpc_enc_pool_get_pages<br>

> <br>

> It seems like maybe that code could be adjusted to serve both the<br>

> encryption case (must not fail, if it can't get memory, return<br>

> -ENOMEM to cause retries), and the compression case (can fail, if it<br>

> fails, should not do compression...  Maybe should consume less<br>

> memory)<br>

<br>

Currently we are not very close to the sptlrpc layer and do not use any<br>

of the encryption structures (it was initially planned, but turned out<br>

differently). But we have already looked at those pools.<br>

<br>

> <br>

> About thread counts:<br>

> Encryption is handled in the ptlrpc code, and your presentation noted<br>

> the plan is to mimic that, which sounds good to me.  That means<br>

> there's no reason for you to explicitly control the number of threads<br>

> doing compression, the same number of threads doing sending will be<br>

> doing compression, which seems fine.  (Unless there's some point of<br>

> contention in the compression code, but that seems unlikely...)<br>

<br>

We currently intervene before the request is created<br>

(osc_brw_prep_request) but still we don't do anything explicitly with<br>

threads, just put some more tasks to the existing ones. Limited<br>

resources is more the later part where we will optimize, tune and<br>

introduce the adaptive part. <br>

<br>

> <br>

> Hope that helps a bit.<br>

<br>

It helps a lot! Thank you!<br>

<br>

Anna<br>

</div>

</span></font>

</body>

</html>