<html>


<head>


<meta http-equiv="Content-Type" content="text/html; charset=utf-8">


</head>


<body style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space;" class="">


<br class="">


<div>


<blockquote type="cite" class="">


<div class="">On Feb 17, 2017, at 12:29 PM, Dilger, Andreas <<a href="mailto:andreas.dilger@intel.com" class="">andreas.dilger@intel.com</a>> wrote:</div>


<br class="Apple-interchange-newline">


<div class=""><span style="font-family: Helvetica; font-size: 15px; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-stroke-width: 0px; float: none; display: inline !important;" class="">On


 Feb 17, 2017, at 12:15, Xiong, Jinshan <</span><a href="mailto:jinshan.xiong@intel.com" style="font-family: Helvetica; font-size: 15px; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-size-adjust: auto; -webkit-text-stroke-width: 0px;" class="">jinshan.xiong@intel.com</a><span style="font-family: Helvetica; font-size: 15px; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-stroke-width: 0px; float: none; display: inline !important;" class="">>


 wrote:</span><br style="font-family: Helvetica; font-size: 15px; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-stroke-width: 0px;" class="">


<blockquote type="cite" style="font-family: Helvetica; font-size: 15px; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-size-adjust: auto; -webkit-text-stroke-width: 0px;" class="">


<br class="">


Hi Anna,<br class="">


<br class="">


Thanks for updating. Please see inserted lines.<br class="">


<br class="">


<blockquote type="cite" class="">On Feb 16, 2017, at 6:15 AM, Anna Fuchs <<a href="mailto:anna.fuchs@informatik.uni-hamburg.de" class="">anna.fuchs@informatik.uni-hamburg.de</a>> wrote:<br class="">


<br class="">


Dear all,<span class="Apple-converted-space"> </span><br class="">


<br class="">


I would like to update you about my progress on the project.<span class="Apple-converted-space"> </span><br class="">


Unfortunately, I can not publish a complete design of the feature,<br class="">


since it changes very much during the development.<span class="Apple-converted-space"> </span><br class="">


<br class="">


First the work related to the client changes:<span class="Apple-converted-space"> </span><br class="">


<br class="">


I had to discard my approach to introduce the changes within the<br class="">


sptlrpc layer for the moment. Compression of the data affects<br class="">


especially the resulting number of pages and therefore number and size<br class="">


of niobufs, size and structure of the descriptor and request, size of<br class="">


the bulk kiov, checksums and in the end the async arguments. Actually<br class="">


it affects everything, that is set within the osc_brw_prep_request<br class="">


function in osc_request.c. When entering the sptlrpc layer, most of<br class="">


that parameters are already set and I would need to update everything.<br class="">


That causes double work and requires a lot of code duplication from the<br class="">


osc module.<span class="Apple-converted-space"> </span><br class="">


<br class="">


My current dirty prototype invokes compression just at the beginning of<br class="">


that function, before niocount is calculated. I need to have a separate<br class="">


bunch of pages to store compressed data so that I would not overwrite<br class="">


the content of the original pages, which may be exposed to the<br class="">


userspace process.<span class="Apple-converted-space"> </span><br class="">


The original pages would be freed and the compressed pages processed<br class="">


for the request and finally also freed.<span class="Apple-converted-space"> </span><br class="">


</blockquote>


<br class="">


Please remember to reserve some pages as emergency pool to avoid the problem that the system memory is in shortage and it needs some free pages for compression to writeback more pages. We may use the same pool to support partial block so it must be greater


 than the largest ZFS block size(I prefer to not compress data for partial blocks).<span class="Apple-converted-space"> </span><br class="">


<br class="">


After RPC is issued, the pages contain compressed data will be pinned in memory for a while for recovery reasons. Therefore, when emergency pages are used, you will have to issue the RPC in sync mode, so that the server can commit the write trans into persistent


 storage and client can use the emergency pages for new RPC immediately.<br class="">


<br class="">


<blockquote type="cite" class=""><br class="">


I also reconsidered the idea to do compression niobuf-wise. Due to the<br class="">


file layout, compression should be done record-wise. Since a niobuf is<br class="">


a technical requirement for the pages to be contiguous, a record (e.g.<br class="">


128KB) is a logical unit. In my understanding, it can happen, that one<br class="">


record contains of several niobufs whenever we do not have enough<br class="">


</blockquote>


<br class="">


We use the terminology ‘chunk’ as the preferred block size on the OST. Let’s use the same terminology ;-)<br class="">


<br class="">


<blockquote type="cite" class="">contiguous pages for a complete record. For that reason, I would like<br class="">


to leave the niobuf structure as is it and introduce a record structure<br class="">


on top of it. That record structure will hold the logical(uncompressed)<br class="">


and physical(compressed) data sizes and the algorithm used for<br class="">


</blockquote>


<br class="">


hmm… not sure if this is the right approach. I tend to think the client will talk with the OST at connecting time and negotiate the compress algorithm, and after that they should use the same algorithm. There is no need to carry this information in every single


 RPC.<br class="">


</blockquote>


<br style="font-family: Helvetica; font-size: 15px; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-stroke-width: 0px;" class="">


<span style="font-family: Helvetica; font-size: 15px; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-stroke-width: 0px; float: none; display: inline !important;" class="">I'm


 not sure I agree.  The benefits of compression may be different on a per-file basis (e.g. .txt vs. .jpg) so there shouldn't be a fixed compression algorithm required for all RPCs.  I could imagine that we don't want to allow a different compression type for


 each block (which ZFS allows), but one compression type per RPC should be OK.  We do the same for the checksum type.</span><br style="font-family: Helvetica; font-size: 15px; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-stroke-width: 0px;" class="">


</div>


</blockquote>


<div><br class="">


</div>


<div>The difference between checksum and compression is that different types of checksum should produce the same results, therefore the clients can pick any checksum algorithm at its own discretion.</div>


<div><br class="">


</div>


<div>As for your example, I think it’s more likely that the OSC will decide to turn off compression for the .jpg file after trying to compress few chunks and figure out there is no benefit by doing that.</div>


<div><br class="">


</div>


<div>Jinshan</div>


<br class="">


<blockquote type="cite" class="">


<div class=""><br style="font-family: Helvetica; font-size: 15px; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-stroke-width: 0px;" class="">


<blockquote type="cite" style="font-family: Helvetica; font-size: 15px; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-size-adjust: auto; -webkit-text-stroke-width: 0px;" class="">


Yes, it’s reasonable to have chunk descriptors in the RPC. When there are multiple compressed chunks packed in one RPC, the exact bufsize for each chunk will be packed as well. Right now, the LNET doesn’t support partial pages inside niobuf(except the first


 and last page), so clients have to provide enough information in the chunk descriptor so the server can deduce the padding size for each chunk in the niobuf.<br class="">


<br class="">


<blockquote type="cite" class="">compression. Initially we wanted to extend the niobuf struct by those<br class="">


fields. I think that change would affect the RPC request structure very<br class="">


much since the first Lustre message fields will not be followed by an<br class="">


array of niobufs, but by an array of records, which can contain an<br class="">


array of niobufs.<span class="Apple-converted-space"> </span><br class="">


</blockquote>


<br class="">


We just need a new format of RPC. Please take a look at RQF_OST_BRW_{READ,WRITE}. What we need is probably some thing like RQF_OST_COMP_BRW_{READ,WRITE}, which is basically the same thing but with chunk descriptor:<br class="">


<br class="">


static const struct req_msg_field *ost_comp_brw_client[] = {<br class="">


       &RMF_PTLRPC_BODY,<br class="">


       &RMF_OST_BODY,<br class="">


       &RMF_OBD_IOOBJ,<br class="">


       &RMF_NIOBUF_REMOTE,<br class="">


<blockquote type="cite" class="">


<blockquote type="cite" class="">


<blockquote type="cite" class="">   &RMF_CHUNK_DESCR,<br class="">


</blockquote>


</blockquote>


</blockquote>


       &RMF_CAPA1<br class="">


};<br class="">


<br class="">


<blockquote type="cite" class="">On the server/storage side, the different niobufs must be then<br class="">


associated with the same record and provided to ZFS.<span class="Apple-converted-space"> </span><br class="">


<br class="">


Server changes:<span class="Apple-converted-space"> </span><br class="">


<br class="">


Since we work on the Lustre/ZFS interface, we think it would be the<br class="">


best to let Lustre compose the header information for every record<br class="">


(psize and algorithm, maybe also the checksum in the future). We will<br class="">


</blockquote>


<br class="">


I tend to let ZFS do this job especially for checksum otherwise if Lustre provided wrong data it would affect the consistency of ZFS.<br class="">


</blockquote>


<br style="font-family: Helvetica; font-size: 15px; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-stroke-width: 0px;" class="">


<span style="font-family: Helvetica; font-size: 15px; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-stroke-width: 0px; float: none; display: inline !important;" class="">We


 want to allow Lustre clients to use the same ZFS checksum in the future, so there needs to be an interface to pass this.  If ZFS verifies the checksum when the write is first submitted, and returns an error before doing actual filesystem modifications then


 it can verify the checksum is correct for that block, and we can skip the Lustre RPC checksum.  This would probably work OK with the "zero copy" interface that we use, where data buffers are preallocated for RDMA without actually being attached to a TXG, and


 then the checksum would be verified by ZFS at submission.</span><br style="font-family: Helvetica; font-size: 15px; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-stroke-width: 0px;" class="">


<br style="font-family: Helvetica; font-size: 15px; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-stroke-width: 0px;" class="">


<blockquote type="cite" style="font-family: Helvetica; font-size: 15px; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-size-adjust: auto; -webkit-text-stroke-width: 0px;" class="">


<blockquote type="cite" class="">store these values at the beginning of every record in 4 Bytes each.<span class="Apple-converted-space"> </span><br class="">


Currently, when ZFS does compression itself, the compressed size is<br class="">


stored only within the compressed data. Some algorithms get it when<br class="">


starting the decompression, for lz4 it is stored at the beginning. With<br class="">


our approach, we would unify the record-metadata for any algorithm, but<br class="">


</blockquote>


<br class="">


Wait, are you suggesting to store record/chunk-metadata into persistent storage?<br class="">


<br class="">


<blockquote type="cite" class="">at the moment it would not be accessible by ZFS without changes to ZFS<br class="">


structures.<span class="Apple-converted-space"> </span><br class="">


<br class="">


ZFS will also hold an extra variable whether the data is compressed at<br class="">


all. When reading and the data is compressed, it is up to Lustre to get<br class="">


the original size and algorithm, to decompress the data and put it into<br class="">


page structure.<span class="Apple-converted-space"> </span><br class="">


</blockquote>


<br class="">


Yes, the server will check the capability of client to decide if to return compressed data.<br class="">


<br class="">


I don't look into the corresponding code but Matt mentioned before this is pretty much the same interface of ZFS send/recv.<br class="">


<br class="">


Thanks,<br class="">


Jinshan<br class="">


<br class="">


<blockquote type="cite" class=""><br class="">


<br class="">


Any comments or ideas are very welcome!<span class="Apple-converted-space"> </span><br class="">


<br class="">


Regards,<br class="">


Anna<br class="">


<br class="">


<br class="">


<br class="">


<br class="">


<br class="">


_______________________________________________<br class="">


lustre-devel mailing list<br class="">


<a href="mailto:lustre-devel@lists.lustre.org" class="">lustre-devel@lists.lustre.org</a><br class="">


http://lists.lustre.org/listinfo.cgi/lustre-devel-lustre.org<br class="">


</blockquote>


</blockquote>


<br style="font-family: Helvetica; font-size: 15px; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-stroke-width: 0px;" class="">


<span style="font-family: Helvetica; font-size: 15px; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-stroke-width: 0px; float: none; display: inline !important;" class="">Cheers,


 Andreas</span><br style="font-family: Helvetica; font-size: 15px; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-stroke-width: 0px;" class="">


<span style="font-family: Helvetica; font-size: 15px; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-stroke-width: 0px; float: none; display: inline !important;" class="">--</span><br style="font-family: Helvetica; font-size: 15px; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-stroke-width: 0px;" class="">


<span style="font-family: Helvetica; font-size: 15px; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-stroke-width: 0px; float: none; display: inline !important;" class="">Andreas


 Dilger</span><br style="font-family: Helvetica; font-size: 15px; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-stroke-width: 0px;" class="">


<span style="font-family: Helvetica; font-size: 15px; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-stroke-width: 0px; float: none; display: inline !important;" class="">Lustre


 Principal Architect</span><br style="font-family: Helvetica; font-size: 15px; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-stroke-width: 0px;" class="">


<span style="font-family: Helvetica; font-size: 15px; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-stroke-width: 0px; float: none; display: inline !important;" class="">Intel


 Corporation</span></div>


</blockquote>


</div>


<br class="">


</body>


</html>