<html>

  <head>

    <meta http-equiv="content-type" content="text/html; charset=ISO-8859-1">

  </head>

  <body bgcolor="#FFFFFF" text="#000000">

    <br>

    <div class="moz-forward-container"><br>

      <br>

      -------- Original Message --------

      <table class="moz-email-headers-table" cellpadding="0"

        cellspacing="0" border="0">

        <tbody>

          <tr>

            <th align="RIGHT" nowrap="nowrap" valign="BASELINE">Subject:

            </th>

            <td>Full stripe write in RAID6</td>

          </tr>

          <tr>

            <th align="RIGHT" nowrap="nowrap" valign="BASELINE">Date: </th>

            <td>Mon, 18 Aug 2014 15:57:25 +0530</td>

          </tr>

          <tr>

            <th align="RIGHT" nowrap="nowrap" valign="BASELINE">From: </th>

            <td>aayush agrawal <a class="moz-txt-link-rfc2396E" href="mailto:aayush.agrawal@calsoftinc.com"><aayush.agrawal@calsoftinc.com></a></td>

          </tr>

          <tr>

            <th align="RIGHT" nowrap="nowrap" valign="BASELINE">To: </th>

            <td><a class="moz-txt-link-abbreviated" href="mailto:lustre-devel@lists.lustre.org">lustre-devel@lists.lustre.org</a></td>

          </tr>

        </tbody>

      </table>

      <br>

      <br>

      <meta http-equiv="content-type" content="text/html;

        charset=ISO-8859-1">

      Hi,<o:p></o:p><br>

      <br>

      I am using lustre version: 2.5.0 and corresponding kernel

      2.6.32-358. <br>

      Apart from default patches which comes with lustre 2.5.0, I

      applied below patches in kernel 2.6.32-358.<br>

      <br>

      raid5-configurable-cachesize-rhel6.patch<br>

      raid5-large-io-rhel5.patch<br>

      raid5-stats-rhel6.patch<br>

      raid5-zerocopy-rhel6.patch<br>

      raid5-mmp-unplug-dev-rhel6.patch<br>

      raid5-mmp-unplug-dev.patch<br>

      raid5-maxsectors-rhel5.patch<br>

      raid5-stripe-by-stripe-handling-rhel6.patch<br>

      <br>

      I have taken all above patches from below link:<br>

      <a moz-do-not-send="true" class="moz-txt-link-freetext"

href="https://github.com/Xyratex/lustre-stable/tree/b_neo_1.4.0/lustre/kernel_patches/patches">https://github.com/Xyratex/lustre-stable/tree/b_neo_1.4.0/lustre/kernel_patches/patches</a><br>

      <br>

      My question is: If I am writing entire stripe then whether RAID6

      md <o:p></o:p> driver need to read any of the blocks from

      underlying device?<o:p></o:p><br>

      I am asking this question on lustre mailing list because I have

      seen that lustre community has changed RAID driver a lot.  <br>

      <o:p></o:p><br>

      I have created RAID6 device with default (512K)  chunk size with

      total 6 RAID devices. <br>

      cat  /sys/block/md127/queue/optimal_io_size =><o:p></o:p> 

      2097152<br>

      I believe this is full stripe (512K * 4 data disks).<o:p></o:p><br>

      <br>

      If I write 2MB data, I am expected to dirty entire stripe hence

      what <o:p></o:p> I believe I need not require to read either any

      of the data block or <o:p></o:p>parity blocks. Thus avoiding

      RAID6 penalties.<br>

      Whether md/raid driver <o:p></o:p> supports full stripe writes by

      avoiding RAID 6 penalties?<o:p></o:p><br>

      <br>

      <o:p></o:p>I also expected 6 disks will receive 512K writes each.

      (4 data disk<o:p></o:p> + 2 parity disks).<o:p></o:p><br>

      <o:p></o:p> If I do IO directly on block device /dev/md127, I do

      observe reads <o:p></o:p> happening on md device and underlying

      raid devices as well.<o:p></o:p><br>

      <o:p></o:p><br>

       #mdstat o/p:<o:p></o:p><br>

       md127 : active raid6 sdah1[5] sdai1[4] sdaj1[3] sdcg1[2] sdch1[1]

      <o:p></o:p><br>

       sdci1[0]<o:p></o:p><br>

             41926656 blocks super 1.2 level 6, 512k chunk, algorithm 2

      <o:p></o:p><br>

       [6/6] [UUUUUU]<o:p></o:p><br>

      <o:p></o:p><br>

       # raw -qa<o:p></o:p><br>

       /dev/raw/raw1:  bound to major 9, minor 127<o:p></o:p><br>

      <o:p></o:p><br>

       #time (dd if=/dev/zero of=/dev/raw/raw1 bs=2M count=1 &&

      sync)<o:p></o:p><br>

      <o:p></o:p>(also tried with of=/dev/md127 oflag=direct but the

      same results.) <br>

       # iostat shows:<o:p></o:p><br>

       Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read  

      Blk_wrtn<o:p></o:p><br>

       sdaj1             7.00         0.00       205.20          0      

      1026<o:p></o:p><br>

       sdai1             6.20         0.00       205.20          0      

      1026<o:p></o:p><br>

       sdah1             9.80         0.00       246.80          0      

      1234<o:p></o:p><br>

       sdcg1             6.80         0.00       205.20          0      

      1026<o:p></o:p><br>

       sdci1             9.60         0.00       246.80          0      

      1234<o:p></o:p><br>

       sdch1             6.80         0.00       205.20          0      

      1026<o:p></o:p><br>

       md127             0.80         0.00       819.20          0      

      4096<o:p></o:p><br>

      <o:p></o:p><br>

      I assume if I perform writes in multiples of “optimal_io_size” I <o:p></o:p>

      would be doing full stripe writes thus avoiding reads.<br>

      But <o:p></o:p> unfortunately with two 2M writes, I do see reads

      happening for some these drives.<o:p></o:p><br>

      Same case for count=4 or 6 (equal to data disks or total disks).<o:p></o:p><br>

      <br>

       # time (dd if=/dev/zero of=/dev/raw/raw1 bs=2M count=2 &&

      sync) <o:p></o:p><br>

      <br>

       Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read  

      Blk_wrtn<o:p></o:p><br>

       sdaj1            13.40       204.80       410.00       1024      

      2050<o:p></o:p><br>

       sdai1            11.20         0.00       410.00          0      

      2050<o:p></o:p><br>

       sdah1            15.80         0.00       464.40          0      

      2322<o:p></o:p><br>

       sdcg1            13.20       204.80       410.00       1024      

      2050<o:p></o:p><br>

       sdci1            16.60         0.00       464.40          0      

      2322<o:p></o:p><br>

       sdch1            12.40       192.00       410.00        960      

      2050<o:p></o:p><br>

       md127             1.60         0.00      1638.40          0      

      8192<o:p></o:p><br>

      <br>

      I believe RAID6 penalties will exist if it’s a random write, but

      in case of seq. write, whether they will still exist in some other

      form <o:p></o:p>in Linux md/raid driver?<o:p></o:p><br>

      My aim is to maximize RAID6 Write IO rate with sequential Writes

      without<o:p></o:p> RAID6 penalties.<o:p></o:p><br>

      <o:p></o:p><br>

      Rectify me wherever my assumptions are wrong. Let me know if any <o:p></o:p>

      other configuration param (for block device or md device) is <o:p></o:p>

      required to achieve the same.<o:p></o:p><br>

      <br>

      Thanks,<br>

      Aayush<br>

      <br>

      <br>

      <br>

    </div>

    <br>

  </body>

</html>