<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<meta content="text/html; charset=ISO-8859-1"
http-equiv="Content-Type">
</head>
<body bgcolor="#ffffff" text="#000000">
Hi Eric,<br>
<br>
Thanks very much for your comparison of the results. I want to give
more explanation for the results:<br>
<br>
1) I suspect the complex CLIO lock state machine and tedious
iteration/repeat operations affect the performance of traversing
large-striped directory, means the overhead introduced by those
factors are higher than original b1_8 I/O stack. To measure
per-stripe overhead, it is unfair that you compare the results
between patched lustre-2.x and luster-1.8, because my AGL related
patches are async pipeline operations, they hide much of such
overhead. But b1_8 is sync glimpse and non-per-fetched. If compare
between original lustre-2.x and lustre-1.8, you will find the
overhead difference. In fact, such overhead difference can be seen
in your second graph also. Just as you said: "<span style="color:
rgb(31, 73, 125);">1.8 gets better with more stripes, patched 2.x
gets worse"</span>.<br>
<br>
2) Currently, the limitation for AGL #/RPC is statahead window.
Originally, such window is only used for controlling MDS-side
statahead. So means, as long as item's MDS-side attributes is ready
(per-fetched), then related OSS-side AGL RPC can be triggered. The
default statahead window size is 32. In my test, I just use the
default value. I also tested with larger window size on Toro, but it
did not give much help. I am not sure whether it can be better if
testing against more powerful nodes/network.<br>
<br>
3) For large-striped directory, the test results maybe not represent
the real cases, because in my test, there are 8 OSTs on each OSS,
but OSS CPU is 4-cores, which is much slower than client node
(24-cores CPU). I found OSS's load was quite high for 32-striped
cases. In theory, there are at most 32 * 8 concurrent AGL RPCs for
each OSS. If we can test on more powerful OSS nodes for large-stripe
directory, the improvement may be better than current results.<br>
<br>
4) If OSS is the performance bottle neck, it also can explain why "<span
style="color: rgb(31, 73, 125);">1.8 gets better with more
stripes, patched 2.x gets worse"</span> on some degree. Because
for b1_8, the glimpse RPCs between two items are sync, so there are
at most 8 concurrent glimpse RPCs for each OSS, means less
contention, so less overhead caused by those contention. I just
guess from the experience of studying SMP scaling.<br>
<br>
<br>
Cheers,<br>
--<br>
Nasf<br>
<br>
On 5/26/11 9:01 PM, Eric Barton wrote:
<blockquote cite="mid:012401cc1ba4$fc090da0$f41b28e0$@com"
type="cite">
<meta http-equiv="Content-Type" content="text/html;
charset=ISO-8859-1">
<meta name="Generator" content="Microsoft Word 12 (filtered
medium)">
<!--[if !mso]><style>v\:* {behavior:url(#default#VML);}
o\:* {behavior:url(#default#VML);}
w\:* {behavior:url(#default#VML);}
.shape {behavior:url(#default#VML);}
</style><![endif]-->
<style><!--
/* Font Definitions */
@font-face
{font-family:"Cambria Math";
panose-1:2 4 5 3 5 4 6 3 2 4;}
@font-face
{font-family:Calibri;
panose-1:2 15 5 2 2 2 4 3 2 4;}
@font-face
{font-family:Tahoma;
panose-1:2 11 6 4 3 5 4 4 2 4;}
/* Style Definitions */
p.MsoNormal, li.MsoNormal, div.MsoNormal
{margin:0cm;
margin-bottom:.0001pt;
font-size:12.0pt;
font-family:"Times New Roman","serif";
color:black;}
a:link, span.MsoHyperlink
{mso-style-priority:99;
color:blue;
text-decoration:underline;}
a:visited, span.MsoHyperlinkFollowed
{mso-style-priority:99;
color:purple;
text-decoration:underline;}
p.MsoAcetate, li.MsoAcetate, div.MsoAcetate
{mso-style-priority:99;
mso-style-link:"Balloon Text Char";
margin:0cm;
margin-bottom:.0001pt;
font-size:8.0pt;
font-family:"Tahoma","sans-serif";
color:black;}
span.EmailStyle17
{mso-style-type:personal-reply;
font-family:"Times New Roman","serif";
color:#1F497D;}
span.BalloonTextChar
{mso-style-name:"Balloon Text Char";
mso-style-priority:99;
mso-style-link:"Balloon Text";
font-family:"Tahoma","sans-serif";
color:black;}
.MsoChpDefault
{mso-style-type:export-only;
font-size:10.0pt;}
@page WordSection1
{size:612.0pt 792.0pt;
margin:72.0pt 72.0pt 72.0pt 72.0pt;}
div.WordSection1
{page:WordSection1;}
--></style><!--[if gte mso 9]><xml>
<o:shapedefaults v:ext="edit" spidmax="2050" />
</xml><![endif]--><!--[if gte mso 9]><xml>
<o:shapelayout v:ext="edit">
<o:idmap v:ext="edit" data="1" />
</o:shapelayout></xml><![endif]-->
<div class="WordSection1">
<p class="MsoNormal"><span style="color: rgb(31, 73, 125);">Nasf,<o:p></o:p></span></p>
<p class="MsoNormal"><span style="color: rgb(31, 73, 125);"><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="color: rgb(31, 73, 125);">Interesting
results. Thank you - especially for graphing the results so
thoroughly.<o:p></o:p></span></p>
<p class="MsoNormal"><span style="color: rgb(31, 73, 125);">I’m
attaching them here and cc-ing lustre-devel since these are
of general interest.<o:p></o:p></span></p>
<p class="MsoNormal"><span style="color: rgb(31, 73, 125);"><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="color: rgb(31, 73, 125);">I
don’t think your conclusion number (1), to say CLIO locking
is slowing us down<o:p></o:p></span></p>
<p class="MsoNormal"><span style="color: rgb(31, 73, 125);">is
as obvious from these results as you imply. If you just
compare the 1.8 and<o:p></o:p></span></p>
<p class="MsoNormal"><span style="color: rgb(31, 73, 125);">patched
2.x per-file times and how they scale with #stripes you get
this…<o:p></o:p></span></p>
<p class="MsoNormal"><span style="color: rgb(31, 73, 125);"><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="color: rgb(31, 73, 125);"><img
id="Chart_x0020_3"
src="cid:part1.07070302.03060502@whamcloud.com"
height="461" width="668"></span><span style="color:
rgb(31, 73, 125);"><o:p></o:p></span></p>
<p class="MsoNormal"><span style="color: rgb(31, 73, 125);"><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="color: rgb(31, 73, 125);">The
gradients of these lines should correspond to the additional
time per stripe required<o:p></o:p></span></p>
<p class="MsoNormal"><span style="color: rgb(31, 73, 125);">to
stat each file and I’ve graphed these times below (ignoring
the 0-stripe data for this<o:p></o:p></span></p>
<p class="MsoNormal"><span style="color: rgb(31, 73, 125);">calculation
because I’m just interested in the incremental per-stripe
overhead).<o:p></o:p></span></p>
<p class="MsoNormal"><span style="color: rgb(31, 73, 125);"><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="color: rgb(31, 73, 125);"><img
id="Chart_x0020_5"
src="cid:part2.08010704.06050106@whamcloud.com"
height="371" width="668"></span><span style="color:
rgb(31, 73, 125);"><o:p></o:p></span></p>
<p class="MsoNormal"><span style="color: rgb(31, 73, 125);">They
show per-stripe overhead for 1.8 well above patched 2.x for
the lower stripe<o:p></o:p></span></p>
<p class="MsoNormal"><span style="color: rgb(31, 73, 125);">counts,
but whereas 1.8 gets better with more stripes, patched 2.x
gets worse. I’m<o:p></o:p></span></p>
<p class="MsoNormal"><span style="color: rgb(31, 73, 125);">guessing
that at high stripe counts, 1.8 puts many concurrent
glimpses on the wire<o:p></o:p></span></p>
<p class="MsoNormal"><span style="color: rgb(31, 73, 125);">and
does it quite efficiently. I’d like to understand better
how you control the #<o:p></o:p></span></p>
<p class="MsoNormal"><span style="color: rgb(31, 73, 125);">of
glimpse-aheads you keep on the wire – is it a single fixed
number, or a fixed<o:p></o:p></span></p>
<p class="MsoNormal"><span style="color: rgb(31, 73, 125);">number
per OST or some other scheme? In any case, it will be
interesting to see<o:p></o:p></span></p>
<p class="MsoNormal"><span style="color: rgb(31, 73, 125);">measurements
at higher stripe counts.<o:p></o:p></span></p>
<blockquote style="margin-top: 5pt; margin-bottom: 5pt;">
<p class="MsoNormal" style=""><span style="color: rgb(31, 73,
125);" lang="EN-US">Cheers, <br>
Eric <o:p></o:p></span></p>
</blockquote>
<div style="border-width: medium medium medium 1.5pt;
border-style: none none none solid; border-color:
-moz-use-text-color -moz-use-text-color -moz-use-text-color
blue; padding: 0cm 0cm 0cm 4pt;">
<div>
<div style="border-right: medium none; border-width: 1pt
medium medium; border-style: solid none none;
border-color: rgb(181, 196, 223) -moz-use-text-color
-moz-use-text-color; padding: 3pt 0cm 0cm;">
<p class="MsoNormal"><b><span style="font-size: 10pt;
font-family:
"Tahoma","sans-serif"; color:
windowtext;" lang="EN-US">From:</span></b><span
style="font-size: 10pt; font-family:
"Tahoma","sans-serif"; color:
windowtext;" lang="EN-US"> Fan Yong
[<a class="moz-txt-link-freetext" href="mailto:yong.fan@whamcloud.com">mailto:yong.fan@whamcloud.com</a>] <br>
<b>Sent:</b> 12 May 2011 10:18 AM<br>
<b>To:</b> Eric Barton<br>
<b>Cc:</b> Bryon Neitzel; Ian Colle; Liang Zhen<br>
<b>Subject:</b> New test results for "ls -Ul"<o:p></o:p></span></p>
</div>
</div>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal" style="margin-bottom: 12pt;">I have
improved statahead load balance mechanism to distribute
statahead load to more CPU units on client. And adjusted AGL
according to CLIO lock state machine. After those
improvement, 'ls -Ul' can run more fast than old patches,
especially on large SMP node.<br>
<br>
On the other hand, as the increasing the degree of
parallelism, the lower network scheduler is becoming
performance bottleneck. So I combine my patches together
with Liang's SMP patches in the test.<o:p></o:p></p>
<table class="MsoNormalTable" style="width: 100%;" border="1"
cellpadding="0" width="100%">
<tbody>
<tr>
<td style="padding: 1.5pt;" valign="top"><br>
</td>
<td style="padding: 1.5pt;" valign="top">
<p class="MsoNormal">client (fat-intel-4, 24 cores)<o:p></o:p></p>
</td>
<td style="padding: 1.5pt;" valign="top">
<p class="MsoNormal">server (client-xxx, 4 OSSes, 8
OSTs on each OSS)<o:p></o:p></p>
</td>
</tr>
<tr>
<td style="padding: 1.5pt;" valign="top">
<p class="MsoNormal">b2x_patched<o:p></o:p></p>
</td>
<td style="padding: 1.5pt;" valign="top">
<p class="MsoNormal">my patches + SMP patches<o:p></o:p></p>
</td>
<td style="padding: 1.5pt;" valign="top">
<p class="MsoNormal">my patches<o:p></o:p></p>
</td>
</tr>
<tr>
<td style="padding: 1.5pt;" valign="top">
<p class="MsoNormal">b18<o:p></o:p></p>
</td>
<td style="padding: 1.5pt;" valign="top">
<p class="MsoNormal">original b1_8<o:p></o:p></p>
</td>
<td style="padding: 1.5pt;" valign="top">
<p class="MsoNormal">share the same server with
"b2x_patched"<o:p></o:p></p>
</td>
</tr>
<tr>
<td style="padding: 1.5pt;" valign="top">
<p class="MsoNormal">b2x_original<o:p></o:p></p>
</td>
<td style="padding: 1.5pt;" valign="top">
<p class="MsoNormal">original b2_x<o:p></o:p></p>
</td>
<td style="padding: 1.5pt;" valign="top">
<p class="MsoNormal">original b2_x<o:p></o:p></p>
</td>
</tr>
</tbody>
</table>
<p class="MsoNormal"><br>
Some notes:<br>
<br>
1) Stripe count affects traversing performance much, and the
impact is more than linear. Even if with all the patches
applied on b2_x, the degree of stripe count impact is still
larger than b1_8. It is related with the complex CLIO lock
state machine and tedious iteration/repeat operations. It is
not easy to make it run as efficiently as b1_8.<br>
<br>
2) Patched b2_x is much faster than original b2_x, for
traversing 400K * 32-striped directory, it is 100 times or
more improved.<br>
<br>
3) Patched b2_x is also faster than b1_8, within our test,
patched b2_x is at least 4X faster than b1_8, which matches
the requirement in ORNL contract.<br>
<br>
4) Original b2_x is faster than b1_8 only for small striped
cases, not more than 4-striped. For large striped cases,
slower than b1_8, which is consistent with ORNL test result.<br>
<br>
5) The largest stripe count is 32 in our test. We have not
enough resource to test more large striped cases. And I also
wonder whether it is worth to test more large striped
directory or not. Because how many customers want to use
large and full striped directory? means contains 1M *
160-striped items in signal directory. If it is rare case,
then wasting lots of time on that is worthless.<br>
<br>
We need to confirm with ORNL what is the last acceptance
test cases and environment, includes:<br>
a) stripe count<br>
b) item count<br>
c) network latency, w/o lnet router, suggest without router.<br>
d) OST count on each OSS<br>
<br>
<br>
Cheers,<br>
--<br>
Nasf<o:p></o:p></p>
</div>
</div>
</blockquote>
<br>
</body>
</html>