<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">

<html>

  <head>

    <meta content="text/html; charset=ISO-8859-1"

      http-equiv="Content-Type">

  </head>

  <body bgcolor="#ffffff" text="#000000">

    Hi Eric,<br>

    <br>

    Thanks very much for your comparison of the results. I want to give

    more explanation for the results:<br>

    <br>

    1) I suspect the complex CLIO lock state machine and tedious

    iteration/repeat operations affect the performance of traversing

    large-striped directory, means the overhead introduced by those

    factors are higher than original b1_8 I/O stack. To measure

    per-stripe overhead, it is unfair that you compare the results

    between patched lustre-2.x and luster-1.8, because my AGL related

    patches are async pipeline operations, they hide much of such

    overhead. But b1_8 is sync glimpse and non-per-fetched. If compare

    between original lustre-2.x and lustre-1.8, you will find the

    overhead difference. In fact, such overhead difference can be seen

    in your second graph also. Just as you said: "<span style="color:

      rgb(31, 73, 125);">1.8 gets better with more stripes, patched 2.x

      gets worse"</span>.<br>

    <br>

    2) Currently, the limitation for AGL #/RPC is statahead window.

    Originally, such window is only used for controlling MDS-side

    statahead. So means, as long as item's MDS-side attributes is ready

    (per-fetched), then related OSS-side AGL RPC can be triggered. The

    default statahead window size is 32. In my test, I just use the

    default value. I also tested with larger window size on Toro, but it

    did not give much help. I am not sure whether it can be better if

    testing against more powerful nodes/network.<br>

    <br>

    3) For large-striped directory, the test results maybe not represent

    the real cases, because in my test, there are 8 OSTs on each OSS,

    but OSS CPU is 4-cores, which is much slower than client node

    (24-cores CPU). I found OSS's load was quite high for 32-striped

    cases. In theory, there are at most 32 * 8 concurrent AGL RPCs for

    each OSS. If we can test on more powerful OSS nodes for large-stripe

    directory, the improvement may be better than current results.<br>

    <br>

    4) If OSS is the performance bottle neck, it also can explain why "<span

      style="color: rgb(31, 73, 125);">1.8 gets better with more

      stripes, patched 2.x gets worse"</span> on some degree. Because

    for b1_8, the glimpse RPCs between two items are sync, so there are

    at most 8 concurrent glimpse RPCs for each OSS, means less

    contention, so less overhead caused by those contention. I just

    guess from the experience of studying SMP scaling.<br>

    <br>

    <br>

    Cheers,<br>

    --<br>

    Nasf<br>

    <br>

    On 5/26/11 9:01 PM, Eric Barton wrote:

    <blockquote cite="mid:012401cc1ba4$fc090da0$f41b28e0$@com"

      type="cite">

      <meta http-equiv="Content-Type" content="text/html;

        charset=ISO-8859-1">

      <meta name="Generator" content="Microsoft Word 12 (filtered

        medium)">

      <!--[if !mso]><style>v\:* {behavior:url(#default#VML);}

o\:* {behavior:url(#default#VML);}

w\:* {behavior:url(#default#VML);}

.shape {behavior:url(#default#VML);}

</style><![endif]-->

      <style><!--

/* Font Definitions */

@font-face

        {font-family:"Cambria Math";

        panose-1:2 4 5 3 5 4 6 3 2 4;}

@font-face

        {font-family:Calibri;

        panose-1:2 15 5 2 2 2 4 3 2 4;}

@font-face

        {font-family:Tahoma;

        panose-1:2 11 6 4 3 5 4 4 2 4;}

/* Style Definitions */

p.MsoNormal, li.MsoNormal, div.MsoNormal

        {margin:0cm;

        margin-bottom:.0001pt;

        font-size:12.0pt;

        font-family:"Times New Roman","serif";

        color:black;}

a:link, span.MsoHyperlink

        {mso-style-priority:99;

        color:blue;

        text-decoration:underline;}

a:visited, span.MsoHyperlinkFollowed

        {mso-style-priority:99;

        color:purple;

        text-decoration:underline;}

p.MsoAcetate, li.MsoAcetate, div.MsoAcetate

        {mso-style-priority:99;

        mso-style-link:"Balloon Text Char";

        margin:0cm;

        margin-bottom:.0001pt;

        font-size:8.0pt;

        font-family:"Tahoma","sans-serif";

        color:black;}

span.EmailStyle17

        {mso-style-type:personal-reply;

        font-family:"Times New Roman","serif";

        color:#1F497D;}

span.BalloonTextChar

        {mso-style-name:"Balloon Text Char";

        mso-style-priority:99;

        mso-style-link:"Balloon Text";

        font-family:"Tahoma","sans-serif";

        color:black;}

.MsoChpDefault

        {mso-style-type:export-only;

        font-size:10.0pt;}

@page WordSection1

        {size:612.0pt 792.0pt;

        margin:72.0pt 72.0pt 72.0pt 72.0pt;}

div.WordSection1

        {page:WordSection1;}

--></style><!--[if gte mso 9]><xml>

<o:shapedefaults v:ext="edit" spidmax="2050" />

</xml><![endif]--><!--[if gte mso 9]><xml>

<o:shapelayout v:ext="edit">

<o:idmap v:ext="edit" data="1" />

</o:shapelayout></xml><![endif]-->

      <div class="WordSection1">

        <p class="MsoNormal"><span style="color: rgb(31, 73, 125);">Nasf,<o:p></o:p></span></p>

        <p class="MsoNormal"><span style="color: rgb(31, 73, 125);"><o:p> </o:p></span></p>

        <p class="MsoNormal"><span style="color: rgb(31, 73, 125);">Interesting

            results.  Thank you - especially for graphing the results so

            thoroughly.<o:p></o:p></span></p>

        <p class="MsoNormal"><span style="color: rgb(31, 73, 125);">I’m

            attaching them here and cc-ing lustre-devel since these are

            of general interest.<o:p></o:p></span></p>

        <p class="MsoNormal"><span style="color: rgb(31, 73, 125);"><o:p> </o:p></span></p>

        <p class="MsoNormal"><span style="color: rgb(31, 73, 125);">I

            don’t think your conclusion number (1), to say CLIO locking

            is slowing us down<o:p></o:p></span></p>

        <p class="MsoNormal"><span style="color: rgb(31, 73, 125);">is

            as obvious from these results as you imply.  If you just

            compare the 1.8 and<o:p></o:p></span></p>

        <p class="MsoNormal"><span style="color: rgb(31, 73, 125);">patched

            2.x per-file times and how they scale with #stripes you get

            this…<o:p></o:p></span></p>

        <p class="MsoNormal"><span style="color: rgb(31, 73, 125);"><o:p> </o:p></span></p>

        <p class="MsoNormal"><span style="color: rgb(31, 73, 125);"><img

              id="Chart_x0020_3"

              src="cid:part1.07070302.03060502@whamcloud.com"

              height="461" width="668"></span><span style="color:

            rgb(31, 73, 125);"><o:p></o:p></span></p>

        <p class="MsoNormal"><span style="color: rgb(31, 73, 125);"><o:p> </o:p></span></p>

        <p class="MsoNormal"><span style="color: rgb(31, 73, 125);">The

            gradients of these lines should correspond to the additional

            time per stripe required<o:p></o:p></span></p>

        <p class="MsoNormal"><span style="color: rgb(31, 73, 125);">to

            stat each file and I’ve graphed these times below (ignoring

            the 0-stripe data for this<o:p></o:p></span></p>

        <p class="MsoNormal"><span style="color: rgb(31, 73, 125);">calculation

            because I’m just interested in the incremental per-stripe

            overhead).<o:p></o:p></span></p>

        <p class="MsoNormal"><span style="color: rgb(31, 73, 125);"><o:p> </o:p></span></p>

        <p class="MsoNormal"><span style="color: rgb(31, 73, 125);"><img

              id="Chart_x0020_5"

              src="cid:part2.08010704.06050106@whamcloud.com"

              height="371" width="668"></span><span style="color:

            rgb(31, 73, 125);"><o:p></o:p></span></p>

        <p class="MsoNormal"><span style="color: rgb(31, 73, 125);">They

            show per-stripe overhead for 1.8 well above patched 2.x for

            the lower stripe<o:p></o:p></span></p>

        <p class="MsoNormal"><span style="color: rgb(31, 73, 125);">counts,

            but whereas 1.8 gets better with more stripes, patched 2.x

            gets worse.  I’m<o:p></o:p></span></p>

        <p class="MsoNormal"><span style="color: rgb(31, 73, 125);">guessing

            that at high stripe counts, 1.8 puts many concurrent

            glimpses on the wire<o:p></o:p></span></p>

        <p class="MsoNormal"><span style="color: rgb(31, 73, 125);">and

            does it quite efficiently.  I’d like to understand better

            how you control the #<o:p></o:p></span></p>

        <p class="MsoNormal"><span style="color: rgb(31, 73, 125);">of

            glimpse-aheads you keep on the wire – is it a single fixed

            number, or a fixed<o:p></o:p></span></p>

        <p class="MsoNormal"><span style="color: rgb(31, 73, 125);">number

            per OST or some other scheme?  In any case, it will be

            interesting to see<o:p></o:p></span></p>

        <p class="MsoNormal"><span style="color: rgb(31, 73, 125);">measurements

            at higher stripe counts.<o:p></o:p></span></p>

        <blockquote style="margin-top: 5pt; margin-bottom: 5pt;">

          <p class="MsoNormal" style=""><span style="color: rgb(31, 73,

              125);" lang="EN-US">Cheers, <br>

                                 Eric <o:p></o:p></span></p>

        </blockquote>

        <div style="border-width: medium medium medium 1.5pt;

          border-style: none none none solid; border-color:

          -moz-use-text-color -moz-use-text-color -moz-use-text-color

          blue; padding: 0cm 0cm 0cm 4pt;">

          <div>

            <div style="border-right: medium none; border-width: 1pt

              medium medium; border-style: solid none none;

              border-color: rgb(181, 196, 223) -moz-use-text-color

              -moz-use-text-color; padding: 3pt 0cm 0cm;">

              <p class="MsoNormal"><b><span style="font-size: 10pt;

                    font-family:

                    "Tahoma","sans-serif"; color:

                    windowtext;" lang="EN-US">From:</span></b><span

                  style="font-size: 10pt; font-family:

                  "Tahoma","sans-serif"; color:

                  windowtext;" lang="EN-US"> Fan Yong

                  [<a class="moz-txt-link-freetext" href="mailto:yong.fan@whamcloud.com">mailto:yong.fan@whamcloud.com</a>] <br>

                  <b>Sent:</b> 12 May 2011 10:18 AM<br>

                  <b>To:</b> Eric Barton<br>

                  <b>Cc:</b> Bryon Neitzel; Ian Colle; Liang Zhen<br>

                  <b>Subject:</b> New test results for "ls -Ul"<o:p></o:p></span></p>

            </div>

          </div>

          <p class="MsoNormal"><o:p> </o:p></p>

          <p class="MsoNormal" style="margin-bottom: 12pt;">I have

            improved statahead load balance mechanism to distribute

            statahead load to more CPU units on client. And adjusted AGL

            according to CLIO lock state machine. After those

            improvement, 'ls -Ul' can run more fast than old patches,

            especially on large SMP node.<br>

            <br>

            On the other hand, as the increasing the degree of

            parallelism, the lower network scheduler is becoming

            performance bottleneck. So I combine my patches together

            with Liang's SMP patches in the test.<o:p></o:p></p>

          <table class="MsoNormalTable" style="width: 100%;" border="1"

            cellpadding="0" width="100%">

            <tbody>

              <tr>

                <td style="padding: 1.5pt;" valign="top"><br>

                </td>

                <td style="padding: 1.5pt;" valign="top">

                  <p class="MsoNormal">client (fat-intel-4, 24 cores)<o:p></o:p></p>

                </td>

                <td style="padding: 1.5pt;" valign="top">

                  <p class="MsoNormal">server (client-xxx, 4 OSSes, 8

                    OSTs on each OSS)<o:p></o:p></p>

                </td>

              </tr>

              <tr>

                <td style="padding: 1.5pt;" valign="top">

                  <p class="MsoNormal">b2x_patched<o:p></o:p></p>

                </td>

                <td style="padding: 1.5pt;" valign="top">

                  <p class="MsoNormal">my patches + SMP patches<o:p></o:p></p>

                </td>

                <td style="padding: 1.5pt;" valign="top">

                  <p class="MsoNormal">my patches<o:p></o:p></p>

                </td>

              </tr>

              <tr>

                <td style="padding: 1.5pt;" valign="top">

                  <p class="MsoNormal">b18<o:p></o:p></p>

                </td>

                <td style="padding: 1.5pt;" valign="top">

                  <p class="MsoNormal">original b1_8<o:p></o:p></p>

                </td>

                <td style="padding: 1.5pt;" valign="top">

                  <p class="MsoNormal">share the same server with

                    "b2x_patched"<o:p></o:p></p>

                </td>

              </tr>

              <tr>

                <td style="padding: 1.5pt;" valign="top">

                  <p class="MsoNormal">b2x_original<o:p></o:p></p>

                </td>

                <td style="padding: 1.5pt;" valign="top">

                  <p class="MsoNormal">original b2_x<o:p></o:p></p>

                </td>

                <td style="padding: 1.5pt;" valign="top">

                  <p class="MsoNormal">original b2_x<o:p></o:p></p>

                </td>

              </tr>

            </tbody>

          </table>

          <p class="MsoNormal"><br>

            Some notes:<br>

            <br>

            1) Stripe count affects traversing performance much, and the

            impact is more than linear. Even if with all the patches

            applied on b2_x, the degree of stripe count impact is still

            larger than b1_8. It is related with the complex CLIO lock

            state machine and tedious iteration/repeat operations. It is

            not easy to make it run as efficiently as b1_8.<br>

            <br>

            2) Patched b2_x is much faster than original b2_x, for

            traversing 400K * 32-striped directory, it is 100 times or

            more improved.<br>

            <br>

            3) Patched b2_x is also faster than b1_8, within our test,

            patched b2_x is at least 4X faster than b1_8, which matches

            the requirement in ORNL contract.<br>

            <br>

            4) Original b2_x is faster than b1_8 only for small striped

            cases, not more than 4-striped. For large striped cases,

            slower than b1_8, which is consistent with ORNL test result.<br>

            <br>

            5) The largest stripe count is 32 in our test. We have not

            enough resource to test more large striped cases. And I also

            wonder whether it is worth to test more large striped

            directory or not. Because how many customers want to use

            large and full striped directory? means contains 1M *

            160-striped items in signal directory. If it is rare case,

            then wasting lots of time on that is worthless.<br>

            <br>

            We need to confirm with ORNL what is the last acceptance

            test cases and environment, includes:<br>

            a) stripe count<br>

            b) item count<br>

            c) network latency, w/o lnet router, suggest without router.<br>

            d) OST count on each OSS<br>

            <br>

            <br>

            Cheers,<br>

            --<br>

            Nasf<o:p></o:p></p>

        </div>

      </div>

    </blockquote>

    <br>

  </body>

</html>