<html>
<head>
<meta http-equiv="content-type" content="text/html; charset=ISO-8859-1">
</head>
<body bgcolor="#FFFFFF" text="#000000">
<br>
<div class="moz-forward-container"><br>
<br>
-------- Original Message --------
<table class="moz-email-headers-table" cellpadding="0"
cellspacing="0" border="0">
<tbody>
<tr>
<th align="RIGHT" nowrap="nowrap" valign="BASELINE">Subject:
</th>
<td>Full stripe write in RAID6</td>
</tr>
<tr>
<th align="RIGHT" nowrap="nowrap" valign="BASELINE">Date: </th>
<td>Mon, 18 Aug 2014 15:57:25 +0530</td>
</tr>
<tr>
<th align="RIGHT" nowrap="nowrap" valign="BASELINE">From: </th>
<td>aayush agrawal <a class="moz-txt-link-rfc2396E" href="mailto:aayush.agrawal@calsoftinc.com"><aayush.agrawal@calsoftinc.com></a></td>
</tr>
<tr>
<th align="RIGHT" nowrap="nowrap" valign="BASELINE">To: </th>
<td><a class="moz-txt-link-abbreviated" href="mailto:lustre-devel@lists.lustre.org">lustre-devel@lists.lustre.org</a></td>
</tr>
</tbody>
</table>
<br>
<br>
<meta http-equiv="content-type" content="text/html;
charset=ISO-8859-1">
Hi,<o:p></o:p><br>
<br>
I am using lustre version: 2.5.0 and corresponding kernel
2.6.32-358. <br>
Apart from default patches which comes with lustre 2.5.0, I
applied below patches in kernel 2.6.32-358.<br>
<br>
raid5-configurable-cachesize-rhel6.patch<br>
raid5-large-io-rhel5.patch<br>
raid5-stats-rhel6.patch<br>
raid5-zerocopy-rhel6.patch<br>
raid5-mmp-unplug-dev-rhel6.patch<br>
raid5-mmp-unplug-dev.patch<br>
raid5-maxsectors-rhel5.patch<br>
raid5-stripe-by-stripe-handling-rhel6.patch<br>
<br>
I have taken all above patches from below link:<br>
<a moz-do-not-send="true" class="moz-txt-link-freetext"
href="https://github.com/Xyratex/lustre-stable/tree/b_neo_1.4.0/lustre/kernel_patches/patches">https://github.com/Xyratex/lustre-stable/tree/b_neo_1.4.0/lustre/kernel_patches/patches</a><br>
<br>
My question is: If I am writing entire stripe then whether RAID6
md <o:p></o:p> driver need to read any of the blocks from
underlying device?<o:p></o:p><br>
I am asking this question on lustre mailing list because I have
seen that lustre community has changed RAID driver a lot. <br>
<o:p></o:p><br>
I have created RAID6 device with default (512K) chunk size with
total 6 RAID devices. <br>
cat /sys/block/md127/queue/optimal_io_size =><o:p></o:p>
2097152<br>
I believe this is full stripe (512K * 4 data disks).<o:p></o:p><br>
<br>
If I write 2MB data, I am expected to dirty entire stripe hence
what <o:p></o:p> I believe I need not require to read either any
of the data block or <o:p></o:p>parity blocks. Thus avoiding
RAID6 penalties.<br>
Whether md/raid driver <o:p></o:p> supports full stripe writes by
avoiding RAID 6 penalties?<o:p></o:p><br>
<br>
<o:p></o:p>I also expected 6 disks will receive 512K writes each.
(4 data disk<o:p></o:p> + 2 parity disks).<o:p></o:p><br>
<o:p></o:p> If I do IO directly on block device /dev/md127, I do
observe reads <o:p></o:p> happening on md device and underlying
raid devices as well.<o:p></o:p><br>
<o:p></o:p><br>
#mdstat o/p:<o:p></o:p><br>
md127 : active raid6 sdah1[5] sdai1[4] sdaj1[3] sdcg1[2] sdch1[1]
<o:p></o:p><br>
sdci1[0]<o:p></o:p><br>
41926656 blocks super 1.2 level 6, 512k chunk, algorithm 2
<o:p></o:p><br>
[6/6] [UUUUUU]<o:p></o:p><br>
<o:p></o:p><br>
# raw -qa<o:p></o:p><br>
/dev/raw/raw1: bound to major 9, minor 127<o:p></o:p><br>
<o:p></o:p><br>
#time (dd if=/dev/zero of=/dev/raw/raw1 bs=2M count=1 &&
sync)<o:p></o:p><br>
<o:p></o:p>(also tried with of=/dev/md127 oflag=direct but the
same results.) <br>
# iostat shows:<o:p></o:p><br>
Device: tps Blk_read/s Blk_wrtn/s Blk_read
Blk_wrtn<o:p></o:p><br>
sdaj1 7.00 0.00 205.20 0
1026<o:p></o:p><br>
sdai1 6.20 0.00 205.20 0
1026<o:p></o:p><br>
sdah1 9.80 0.00 246.80 0
1234<o:p></o:p><br>
sdcg1 6.80 0.00 205.20 0
1026<o:p></o:p><br>
sdci1 9.60 0.00 246.80 0
1234<o:p></o:p><br>
sdch1 6.80 0.00 205.20 0
1026<o:p></o:p><br>
md127 0.80 0.00 819.20 0
4096<o:p></o:p><br>
<o:p></o:p><br>
I assume if I perform writes in multiples of “optimal_io_size” I <o:p></o:p>
would be doing full stripe writes thus avoiding reads.<br>
But <o:p></o:p> unfortunately with two 2M writes, I do see reads
happening for some these drives.<o:p></o:p><br>
Same case for count=4 or 6 (equal to data disks or total disks).<o:p></o:p><br>
<br>
# time (dd if=/dev/zero of=/dev/raw/raw1 bs=2M count=2 &&
sync) <o:p></o:p><br>
<br>
Device: tps Blk_read/s Blk_wrtn/s Blk_read
Blk_wrtn<o:p></o:p><br>
sdaj1 13.40 204.80 410.00 1024
2050<o:p></o:p><br>
sdai1 11.20 0.00 410.00 0
2050<o:p></o:p><br>
sdah1 15.80 0.00 464.40 0
2322<o:p></o:p><br>
sdcg1 13.20 204.80 410.00 1024
2050<o:p></o:p><br>
sdci1 16.60 0.00 464.40 0
2322<o:p></o:p><br>
sdch1 12.40 192.00 410.00 960
2050<o:p></o:p><br>
md127 1.60 0.00 1638.40 0
8192<o:p></o:p><br>
<br>
I believe RAID6 penalties will exist if it’s a random write, but
in case of seq. write, whether they will still exist in some other
form <o:p></o:p>in Linux md/raid driver?<o:p></o:p><br>
My aim is to maximize RAID6 Write IO rate with sequential Writes
without<o:p></o:p> RAID6 penalties.<o:p></o:p><br>
<o:p></o:p><br>
Rectify me wherever my assumptions are wrong. Let me know if any <o:p></o:p>
other configuration param (for block device or md device) is <o:p></o:p>
required to achieve the same.<o:p></o:p><br>
<br>
Thanks,<br>
Aayush<br>
<br>
<br>
<br>
</div>
<br>
</body>
</html>