<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:w="urn:schemas-microsoft-com:office:word" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns:m="http://schemas.microsoft.com/office/2004/12/omml" xmlns="http://www.w3.org/TR/REC-html40"><head><meta http-equiv=Content-Type content="text/html; charset=us-ascii"><meta name=Generator content="Microsoft Word 12 (filtered medium)"><!--[if !mso]><style>v\:* {behavior:url(#default#VML);}

o\:* {behavior:url(#default#VML);}

w\:* {behavior:url(#default#VML);}

.shape {behavior:url(#default#VML);}

</style><![endif]--><style><!--

/* Font Definitions */

@font-face

        {font-family:"Cambria Math";

        panose-1:2 4 5 3 5 4 6 3 2 4;}

@font-face

        {font-family:Calibri;

        panose-1:2 15 5 2 2 2 4 3 2 4;}

@font-face

        {font-family:Tahoma;

        panose-1:2 11 6 4 3 5 4 4 2 4;}

/* Style Definitions */

p.MsoNormal, li.MsoNormal, div.MsoNormal

        {margin:0cm;

        margin-bottom:.0001pt;

        font-size:12.0pt;

        font-family:"Times New Roman","serif";

        color:black;}

a:link, span.MsoHyperlink

        {mso-style-priority:99;

        color:blue;

        text-decoration:underline;}

a:visited, span.MsoHyperlinkFollowed

        {mso-style-priority:99;

        color:purple;

        text-decoration:underline;}

p.MsoAcetate, li.MsoAcetate, div.MsoAcetate

        {mso-style-priority:99;

        mso-style-link:"Balloon Text Char";

        margin:0cm;

        margin-bottom:.0001pt;

        font-size:8.0pt;

        font-family:"Tahoma","sans-serif";

        color:black;}

span.BalloonTextChar

        {mso-style-name:"Balloon Text Char";

        mso-style-priority:99;

        mso-style-link:"Balloon Text";

        font-family:"Tahoma","sans-serif";

        color:black;}

span.EmailStyle19

        {mso-style-type:personal;

        font-family:"Times New Roman","serif";

        color:#1F497D;}

span.EmailStyle20

        {mso-style-type:personal-reply;

        font-family:"Times New Roman","serif";

        color:#993366;}

.MsoChpDefault

        {mso-style-type:export-only;

        font-size:10.0pt;}

@page WordSection1

        {size:612.0pt 792.0pt;

        margin:72.0pt 72.0pt 72.0pt 72.0pt;}

div.WordSection1

        {page:WordSection1;}

--></style><!--[if gte mso 9]><xml>

<o:shapedefaults v:ext="edit" spidmax="1026" />

</xml><![endif]--><!--[if gte mso 9]><xml>

<o:shapelayout v:ext="edit">

<o:idmap v:ext="edit" data="1" />

</o:shapelayout></xml><![endif]--></head><body bgcolor=white lang=EN-GB link=blue vlink=purple><div class=WordSection1><p class=MsoNormal><span style='color:#993366'>Nasf,<o:p></o:p></span></p><p class=MsoNormal><span style='color:#993366'><o:p> </o:p></span></p><p class=MsoNormal><span style='color:#993366'>I agree that we have to be careful comparing 1.8 and patched 2.x since 1.8 is doing<o:p></o:p></span></p><p class=MsoNormal><span style='color:#993366'>no RPC pipelining to the MDS or OSSs – however I still think (unless you can show<o:p></o:p></span></p><p class=MsoNormal><span style='color:#993366'>me the hole in my reasoning) that comparing the slopes of the time v. # stripes graphs<o:p></o:p></span></p><p class=MsoNormal><span style='color:#993366'>is fair.  These slopes correspond to the additional time it takes to stat a file with more<o:p></o:p></span></p><p class=MsoNormal><span style='color:#993366'>stripes.  Although total per-file stat times in 1.8 are dominated by RPC round-trips<o:p></o:p></span></p><p class=MsoNormal><span style='color:#993366'>to the MDS and OSSes – the OSS RPCs are all sent concurrently, so the incremental<o:p></o:p></span></p><p class=MsoNormal><span style='color:#993366'>time per stripe should be the time it takes to traverse the stack for each stripe and<o:p></o:p></span></p><p class=MsoNormal><span style='color:#993366'>issue the RPC.  Similarly for 2.x, the incremental time per stripe should also be the<o:p></o:p></span></p><p class=MsoNormal><span style='color:#993366'>time it takes to traverse the stack for each strip and queue the async glimpse.<o:p></o:p></span></p><p class=MsoNormal><span style='color:#993366'><o:p> </o:p></span></p><p class=MsoNormal><span style='color:#993366'>In any case, I think measurements of higher stripe counts on a larger server cluster<o:p></o:p></span></p><p class=MsoNormal><span style='color:#993366'>will be revealing.<o:p></o:p></span></p><p class=MsoNormal><span style='color:#993366'><o:p> </o:p></span></p><blockquote style='margin-top:5.0pt;margin-bottom:5.0pt'><p class=MsoNormal><span style='color:#993366'>Cheers,<br>                   Eric <o:p></o:p></span></p></blockquote><p class=MsoNormal><span style='color:#993366'><o:p> </o:p></span></p><div style='border:none;border-left:solid blue 1.5pt;padding:0cm 0cm 0cm 4.0pt'><div><div style='border:none;border-top:solid #B5C4DF 1.0pt;padding:3.0pt 0cm 0cm 0cm'><p class=MsoNormal><b><span lang=EN-US style='font-size:10.0pt;font-family:"Tahoma","sans-serif";color:windowtext'>From:</span></b><span lang=EN-US style='font-size:10.0pt;font-family:"Tahoma","sans-serif";color:windowtext'> Fan Yong [mailto:yong.fan@whamcloud.com] <br><b>Sent:</b> 26 May 2011 3:36 PM<br><b>To:</b> Eric Barton<br><b>Cc:</b> 'Bryon Neitzel'; 'Ian Colle'; 'Liang Zhen'; lustre-devel@lists.lustre.org<br><b>Subject:</b> Re: New test results for "ls -Ul"<o:p></o:p></span></p></div></div><p class=MsoNormal><o:p> </o:p></p><p class=MsoNormal>Hi Eric,<br><br>Thanks very much for your comparison of the results. I want to give more explanation for the results:<br><br>1) I suspect the complex CLIO lock state machine and tedious iteration/repeat operations affect the performance of traversing large-striped directory, means the overhead introduced by those factors are higher than original b1_8 I/O stack. To measure per-stripe overhead, it is unfair that you compare the results between patched lustre-2.x and luster-1.8, because my AGL related patches are async pipeline operations, they hide much of such overhead. But b1_8 is sync glimpse and non-per-fetched. If compare between original lustre-2.x and lustre-1.8, you will find the overhead difference. In fact, such overhead difference can be seen in your second graph also. Just as you said: "<span style='color:#1F497D'>1.8 gets better with more stripes, patched 2.x gets worse"</span>.<br><br>2) Currently, the limitation for AGL #/RPC is statahead window. Originally, such window is only used for controlling MDS-side statahead. So means, as long as item's MDS-side attributes is ready (per-fetched), then related OSS-side AGL RPC can be triggered. The default statahead window size is 32. In my test, I just use the default value. I also tested with larger window size on Toro, but it did not give much help. I am not sure whether it can be better if testing against more powerful nodes/network.<br><br>3) For large-striped directory, the test results maybe not represent the real cases, because in my test, there are 8 OSTs on each OSS, but OSS CPU is 4-cores, which is much slower than client node (24-cores CPU). I found OSS's load was quite high for 32-striped cases. In theory, there are at most 32 * 8 concurrent AGL RPCs for each OSS. If we can test on more powerful OSS nodes for large-stripe directory, the improvement may be better than current results.<br><br>4) If OSS is the performance bottle neck, it also can explain why "<span style='color:#1F497D'>1.8 gets better with more stripes, patched 2.x gets worse"</span> on some degree. Because for b1_8, the glimpse RPCs between two items are sync, so there are at most 8 concurrent glimpse RPCs for each OSS, means less contention, so less overhead caused by those contention. I just guess from the experience of studying SMP scaling.<br><br><br>Cheers,<br>--<br>Nasf<br><br>On 5/26/11 9:01 PM, Eric Barton wrote: <o:p></o:p></p><p class=MsoNormal><span style='color:#1F497D'>Nasf,</span><o:p></o:p></p><p class=MsoNormal><span style='color:#1F497D'> </span><o:p></o:p></p><p class=MsoNormal><span style='color:#1F497D'>Interesting results.  Thank you - especially for graphing the results so thoroughly.</span><o:p></o:p></p><p class=MsoNormal><span style='color:#1F497D'>I’m attaching them here and cc-ing lustre-devel since these are of general interest.</span><o:p></o:p></p><p class=MsoNormal><span style='color:#1F497D'> </span><o:p></o:p></p><p class=MsoNormal><span style='color:#1F497D'>I don’t think your conclusion number (1), to say CLIO locking is slowing us down</span><o:p></o:p></p><p class=MsoNormal><span style='color:#1F497D'>is as obvious from these results as you imply.  If you just compare the 1.8 and</span><o:p></o:p></p><p class=MsoNormal><span style='color:#1F497D'>patched 2.x per-file times and how they scale with #stripes you get this…</span><o:p></o:p></p><p class=MsoNormal><span style='color:#1F497D'> </span><o:p></o:p></p><p class=MsoNormal><span style='color:#1F497D'><img width=668 height=461 id="Chart_x0020_3" src="cid:image001.png@01CC1BD3.48E26DF0"></span><o:p></o:p></p><p class=MsoNormal><span style='color:#1F497D'> </span><o:p></o:p></p><p class=MsoNormal><span style='color:#1F497D'>The gradients of these lines should correspond to the additional time per stripe required</span><o:p></o:p></p><p class=MsoNormal><span style='color:#1F497D'>to stat each file and I’ve graphed these times below (ignoring the 0-stripe data for this</span><o:p></o:p></p><p class=MsoNormal><span style='color:#1F497D'>calculation because I’m just interested in the incremental per-stripe overhead).</span><o:p></o:p></p><p class=MsoNormal><span style='color:#1F497D'> </span><o:p></o:p></p><p class=MsoNormal><span style='color:#1F497D'><img width=668 height=371 id="Chart_x0020_5" src="cid:image002.png@01CC1BD3.48E26DF0"></span><o:p></o:p></p><p class=MsoNormal><span style='color:#1F497D'>They show per-stripe overhead for 1.8 well above patched 2.x for the lower stripe</span><o:p></o:p></p><p class=MsoNormal><span style='color:#1F497D'>counts, but whereas 1.8 gets better with more stripes, patched 2.x gets worse.  I’m</span><o:p></o:p></p><p class=MsoNormal><span style='color:#1F497D'>guessing that at high stripe counts, 1.8 puts many concurrent glimpses on the wire</span><o:p></o:p></p><p class=MsoNormal><span style='color:#1F497D'>and does it quite efficiently.  I’d like to understand better how you control the #</span><o:p></o:p></p><p class=MsoNormal><span style='color:#1F497D'>of glimpse-aheads you keep on the wire – is it a single fixed number, or a fixed</span><o:p></o:p></p><p class=MsoNormal><span style='color:#1F497D'>number per OST or some other scheme?  In any case, it will be interesting to see</span><o:p></o:p></p><p class=MsoNormal><span style='color:#1F497D'>measurements at higher stripe counts.</span><o:p></o:p></p><blockquote style='margin-top:5.0pt;margin-bottom:5.0pt'><p class=MsoNormal><span lang=EN-US>Cheers, <br>                   Eric </span><o:p></o:p></p></blockquote><div style='border:none;border-left:solid windowtext 1.5pt;padding:0cm 0cm 0cm 4.0pt;border-color:-moz-use-text-color -moz-use-text-color -moz-use-text-color
          blue'><div><div style='border:none;border-top:solid windowtext 1.0pt;padding:3.0pt 0cm 0cm 0cm;border-color:-moz-use-text-color
              -moz-use-text-color'><p class=MsoNormal><b><span lang=EN-US style='font-size:10.0pt;font-family:"Tahoma","sans-serif";color:windowtext'>From:</span></b><span lang=EN-US style='font-size:10.0pt;font-family:"Tahoma","sans-serif";color:windowtext'> Fan Yong [<a href="mailto:yong.fan@whamcloud.com">mailto:yong.fan@whamcloud.com</a>] <br><b>Sent:</b> 12 May 2011 10:18 AM<br><b>To:</b> Eric Barton<br><b>Cc:</b> Bryon Neitzel; Ian Colle; Liang Zhen<br><b>Subject:</b> New test results for "ls -Ul"</span><o:p></o:p></p></div></div><p class=MsoNormal> <o:p></o:p></p><p class=MsoNormal style='margin-bottom:12.0pt'>I have improved statahead load balance mechanism to distribute statahead load to more CPU units on client. And adjusted AGL according to CLIO lock state machine. After those improvement, 'ls -Ul' can run more fast than old patches, especially on large SMP node.<br><br>On the other hand, as the increasing the degree of parallelism, the lower network scheduler is becoming performance bottleneck. So I combine my patches together with Liang's SMP patches in the test.<o:p></o:p></p><table class=MsoNormalTable border=1 cellpadding=0 width="100%" style='width:100.0%'><tr><td valign=top style='padding:1.5pt 1.5pt 1.5pt 1.5pt'></td><td valign=top style='padding:1.5pt 1.5pt 1.5pt 1.5pt'><p class=MsoNormal>client (fat-intel-4, 24 cores)<o:p></o:p></p></td><td valign=top style='padding:1.5pt 1.5pt 1.5pt 1.5pt'><p class=MsoNormal>server (client-xxx, 4 OSSes, 8 OSTs on each OSS)<o:p></o:p></p></td></tr><tr><td valign=top style='padding:1.5pt 1.5pt 1.5pt 1.5pt'><p class=MsoNormal>b2x_patched<o:p></o:p></p></td><td valign=top style='padding:1.5pt 1.5pt 1.5pt 1.5pt'><p class=MsoNormal>my patches + SMP patches<o:p></o:p></p></td><td valign=top style='padding:1.5pt 1.5pt 1.5pt 1.5pt'><p class=MsoNormal>my patches<o:p></o:p></p></td></tr><tr><td valign=top style='padding:1.5pt 1.5pt 1.5pt 1.5pt'><p class=MsoNormal>b18<o:p></o:p></p></td><td valign=top style='padding:1.5pt 1.5pt 1.5pt 1.5pt'><p class=MsoNormal>original b1_8<o:p></o:p></p></td><td valign=top style='padding:1.5pt 1.5pt 1.5pt 1.5pt'><p class=MsoNormal>share the same server with "b2x_patched"<o:p></o:p></p></td></tr><tr><td valign=top style='padding:1.5pt 1.5pt 1.5pt 1.5pt'><p class=MsoNormal>b2x_original<o:p></o:p></p></td><td valign=top style='padding:1.5pt 1.5pt 1.5pt 1.5pt'><p class=MsoNormal>original b2_x<o:p></o:p></p></td><td valign=top style='padding:1.5pt 1.5pt 1.5pt 1.5pt'><p class=MsoNormal>original b2_x<o:p></o:p></p></td></tr></table><p class=MsoNormal><br>Some notes:<br><br>1) Stripe count affects traversing performance much, and the impact is more than linear. Even if with all the patches applied on b2_x, the degree of stripe count impact is still larger than b1_8. It is related with the complex CLIO lock state machine and tedious iteration/repeat operations. It is not easy to make it run as efficiently as b1_8.<br><br>2) Patched b2_x is much faster than original b2_x, for traversing 400K * 32-striped directory, it is 100 times or more improved.<br><br>3) Patched b2_x is also faster than b1_8, within our test, patched b2_x is at least 4X faster than b1_8, which matches the requirement in ORNL contract.<br><br>4) Original b2_x is faster than b1_8 only for small striped cases, not more than 4-striped. For large striped cases, slower than b1_8, which is consistent with ORNL test result.<br><br>5) The largest stripe count is 32 in our test. We have not enough resource to test more large striped cases. And I also wonder whether it is worth to test more large striped directory or not. Because how many customers want to use large and full striped directory? means contains 1M * 160-striped items in signal directory. If it is rare case, then wasting lots of time on that is worthless.<br><br>We need to confirm with ORNL what is the last acceptance test cases and environment, includes:<br>a) stripe count<br>b) item count<br>c) network latency, w/o lnet router, suggest without router.<br>d) OST count on each OSS<br><br><br>Cheers,<br>--<br>Nasf<o:p></o:p></p></div><p class=MsoNormal><o:p> </o:p></p></div></div></body></html>