<html>

  <head>

    <meta content="text/html; charset=utf-8" http-equiv="Content-Type">

  </head>

  <body bgcolor="#FFFFFF" text="#000000">

    <p>Hi Megan (et al.),<br>

    </p>

    <p>I don't understand the behavior, either... I've worked

      successfully with changelogs in the past, and indeed it is very

      lightweight. (Since robinhood has not been running anywhere, I'd

      already removed all the changelog readers from the various MDTs

      for the reasons you noted.)</p>

    <p>Whatever my problem is does not manifest as a load issue, on

      either client or MDT side. It manifests rather as some sort of

      connection failure. Here's the most recent example, which maybe

      will generate more ideas as to cause.<br>

    </p>

    <p>On our third lustre fs (one we use for backups), I was able to

      complete a file system scan to populate the database, but then

      when I activated changelogs, the client almost immediately

      experienced the disconnections we've seen on the other two

      systems.</p>

    <p>Here's the log from the MDT (heinlein, <tt>10.7.17.126</tt>).

      The robinhood client is akebono (10.7.17.122)<tt>:</tt><tt><br>

      </tt><tt>May 16 16:05:51 heinlein kernel: Lustre: lard-MDD0000:

        changelog on</tt><tt><br>

      </tt><tt>May 16 16:05:51 heinlein kernel: Lustre: Modifying

        parameter general.mdd.lard-MDT*.changelog_mask in log params</tt><tt><br>

      </tt><tt>May 16 16:13:16 heinlein kernel: Lustre: lard-MDT0000:

        Client 2d1aedc0-1f5e-2741-689a-169922a2593b (at

        10.7.17.122@o2ib) reconnecting</tt><tt><br>

      </tt><tt>May 16 16:13:17 heinlein kernel: Lustre: lard-MDT0000:

        Client 2d1aedc0-1f5e-2741-689a-169922a2593b (at

        10.7.17.122@o2ib) reconnecting</tt><tt><br>

      </tt><tt>May 16 16:13:17 heinlein kernel: Lustre: Skipped 7458

        previous similar messages</tt><tt><br>

      </tt><br>

    </p>

    <p>Here's what akebono (10.7.17.122) reported: <br>

      <br>

      <tt>May 16 16:13:16 akebono kernel: LustreError: 11-0:

        lard-MDT0000-mdc-ffff880fd68d7000: Communicating with

        10.7.17.126@o2ib, operation llog_origin_handle_destroy failed

        with -19.</tt><tt><br>

      </tt><tt>May 16 16:13:16 akebono kernel: Lustre:

        lard-MDT0000-mdc-ffff880fd68d7000: Connection to lard-MDT0000

        (at 10.7.17.126@o2ib) was lost; in progress operations using

        this service will wait for recovery to complete</tt><tt><br>

      </tt><tt>May 16 16:13:16 akebono kernel: Lustre:

        lard-MDT0000-mdc-ffff880fd68d7000: Connection restored to

        lard-MDT0000 (at 10.7.17.126@o2ib)</tt><tt><br>

      </tt><tt>May 16 16:13:17 akebono kernel: LustreError: 11-0:

        lard-MDT0000-mdc-ffff880fd68d7000: Communicating with

        10.7.17.126@o2ib, operation llog_origin_handle_destroy failed

        with -19.</tt><tt><br>

      </tt><tt>May 16 16:13:17 akebono kernel: LustreError: Skipped 7458

        previous similar messages</tt><tt><br>

      </tt><tt>May 16 16:13:17 akebono kernel: Lustre:

        lard-MDT0000-mdc-ffff880fd68d7000: Connection to lard-MDT0000

        (at 10.7.17.126@o2ib) was lost; in progress operations using

        this service will wait for recovery to complete</tt><tt><br>

      </tt><tt>May 16 16:13:17 akebono kernel: Lustre: Skipped 7458

        previous similar messages</tt><tt><br>

      </tt><tt>May 16 16:13:17 akebono kernel: Lustre:

        lard-MDT0000-mdc-ffff880fd68d7000: Connection restored to

        lard-MDT0000 (at 10.7.17.126@o2ib)</tt><tt><br>

      </tt><tt>May 16 16:13:17 akebono kernel: Lustre: Skipped 7458

        previous similar messages</tt><tt><br>

      </tt><tt>May 16 16:13:18 akebono kernel: LustreError: 11-0:

        lard-MDT0000-mdc-ffff880fd68d7000: Communicating with

        10.7.17.126@o2ib, operation llog_origin_handle_destroy failed

        with -19.</tt><tt><br>

      </tt><tt>May 16 16:13:18 akebono kernel: LustreError: Skipped

        14924 previous similar messages</tt></p>

    <p>Jessica<br>

    </p>

    <div class="moz-cite-prefix">On 5/19/17 8:58 AM, Ms. Megan Larko

      wrote:<br>

    </div>

    <blockquote

cite="mid:CAPAniMbcr6bV-LH0Yu2ZTZwa3n9Hkb7_uF7cRHc=sVE2nY9UmQ@mail.gmail.com"

      type="cite">

      <div dir="ltr">

        <div>Greetings Jessica,</div>

        <div><br>

        </div>

        <div>I'm not sure I am correctly understanding the behavior

          "robinhood activity floods the MDT".   The robinhood program

          as you (and I) are using it is consuming the MDT CHANGELOG via

          a reader_id which was assigned when the CHANGELOG was enabled

          on the MDT.   You can check the MDS for these readers via

          "lctl get_param mdd.*.changelog_users".  Each CHANGELOG reader

          must either be consumed by a process or destroyed otherwise

          the CHANGELOG will grow until it consumes sufficient space to

          stop the MDT from functioning correctly.  So robinhood should

          consume and then clear the CHANGELOG via this reader_id.  This

          implementation of robinhood is actually a rather light-weight

          process as far as the MDS is concerned.   The load issues I

          encountered were on the robinhood server itself which is a

          separate server from the Lustre MGS/MDS server.</div>

        <div><br>

        </div>

        <div>Just curious, have you checked for multiple reader_id's on

          your MDS for this Lustre file system?</div>

        <div><br>

        </div>

        <div>P.S. My robinhood configuration file is using nb_threads =

          8, just for a data point.</div>

        <div><br>

        </div>

        <div>Cheers,</div>

        <div>megan</div>

      </div>

      <div class="gmail_extra"><br>

        <div class="gmail_quote"><br>

        </div>

        <br>

      </div>

    </blockquote>

    <br>

  </body>

</html>