<html>
<head>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type">
</head>
<body bgcolor="#FFFFFF" text="#000000">
<p>Hi Megan (et al.),<br>
</p>
<p>I don't understand the behavior, either... I've worked
successfully with changelogs in the past, and indeed it is very
lightweight. (Since robinhood has not been running anywhere, I'd
already removed all the changelog readers from the various MDTs
for the reasons you noted.)</p>
<p>Whatever my problem is does not manifest as a load issue, on
either client or MDT side. It manifests rather as some sort of
connection failure. Here's the most recent example, which maybe
will generate more ideas as to cause.<br>
</p>
<p>On our third lustre fs (one we use for backups), I was able to
complete a file system scan to populate the database, but then
when I activated changelogs, the client almost immediately
experienced the disconnections we've seen on the other two
systems.</p>
<p>Here's the log from the MDT (heinlein, <tt>10.7.17.126</tt>).
The robinhood client is akebono (10.7.17.122)<tt>:</tt><tt><br>
</tt><tt>May 16 16:05:51 heinlein kernel: Lustre: lard-MDD0000:
changelog on</tt><tt><br>
</tt><tt>May 16 16:05:51 heinlein kernel: Lustre: Modifying
parameter general.mdd.lard-MDT*.changelog_mask in log params</tt><tt><br>
</tt><tt>May 16 16:13:16 heinlein kernel: Lustre: lard-MDT0000:
Client 2d1aedc0-1f5e-2741-689a-169922a2593b (at
10.7.17.122@o2ib) reconnecting</tt><tt><br>
</tt><tt>May 16 16:13:17 heinlein kernel: Lustre: lard-MDT0000:
Client 2d1aedc0-1f5e-2741-689a-169922a2593b (at
10.7.17.122@o2ib) reconnecting</tt><tt><br>
</tt><tt>May 16 16:13:17 heinlein kernel: Lustre: Skipped 7458
previous similar messages</tt><tt><br>
</tt><br>
</p>
<p>Here's what akebono (10.7.17.122) reported: <br>
<br>
<tt>May 16 16:13:16 akebono kernel: LustreError: 11-0:
lard-MDT0000-mdc-ffff880fd68d7000: Communicating with
10.7.17.126@o2ib, operation llog_origin_handle_destroy failed
with -19.</tt><tt><br>
</tt><tt>May 16 16:13:16 akebono kernel: Lustre:
lard-MDT0000-mdc-ffff880fd68d7000: Connection to lard-MDT0000
(at 10.7.17.126@o2ib) was lost; in progress operations using
this service will wait for recovery to complete</tt><tt><br>
</tt><tt>May 16 16:13:16 akebono kernel: Lustre:
lard-MDT0000-mdc-ffff880fd68d7000: Connection restored to
lard-MDT0000 (at 10.7.17.126@o2ib)</tt><tt><br>
</tt><tt>May 16 16:13:17 akebono kernel: LustreError: 11-0:
lard-MDT0000-mdc-ffff880fd68d7000: Communicating with
10.7.17.126@o2ib, operation llog_origin_handle_destroy failed
with -19.</tt><tt><br>
</tt><tt>May 16 16:13:17 akebono kernel: LustreError: Skipped 7458
previous similar messages</tt><tt><br>
</tt><tt>May 16 16:13:17 akebono kernel: Lustre:
lard-MDT0000-mdc-ffff880fd68d7000: Connection to lard-MDT0000
(at 10.7.17.126@o2ib) was lost; in progress operations using
this service will wait for recovery to complete</tt><tt><br>
</tt><tt>May 16 16:13:17 akebono kernel: Lustre: Skipped 7458
previous similar messages</tt><tt><br>
</tt><tt>May 16 16:13:17 akebono kernel: Lustre:
lard-MDT0000-mdc-ffff880fd68d7000: Connection restored to
lard-MDT0000 (at 10.7.17.126@o2ib)</tt><tt><br>
</tt><tt>May 16 16:13:17 akebono kernel: Lustre: Skipped 7458
previous similar messages</tt><tt><br>
</tt><tt>May 16 16:13:18 akebono kernel: LustreError: 11-0:
lard-MDT0000-mdc-ffff880fd68d7000: Communicating with
10.7.17.126@o2ib, operation llog_origin_handle_destroy failed
with -19.</tt><tt><br>
</tt><tt>May 16 16:13:18 akebono kernel: LustreError: Skipped
14924 previous similar messages</tt></p>
<p>Jessica<br>
</p>
<div class="moz-cite-prefix">On 5/19/17 8:58 AM, Ms. Megan Larko
wrote:<br>
</div>
<blockquote
cite="mid:CAPAniMbcr6bV-LH0Yu2ZTZwa3n9Hkb7_uF7cRHc=sVE2nY9UmQ@mail.gmail.com"
type="cite">
<div dir="ltr">
<div>Greetings Jessica,</div>
<div><br>
</div>
<div>I'm not sure I am correctly understanding the behavior
"robinhood activity floods the MDT". The robinhood program
as you (and I) are using it is consuming the MDT CHANGELOG via
a reader_id which was assigned when the CHANGELOG was enabled
on the MDT. You can check the MDS for these readers via
"lctl get_param mdd.*.changelog_users". Each CHANGELOG reader
must either be consumed by a process or destroyed otherwise
the CHANGELOG will grow until it consumes sufficient space to
stop the MDT from functioning correctly. So robinhood should
consume and then clear the CHANGELOG via this reader_id. This
implementation of robinhood is actually a rather light-weight
process as far as the MDS is concerned. The load issues I
encountered were on the robinhood server itself which is a
separate server from the Lustre MGS/MDS server.</div>
<div><br>
</div>
<div>Just curious, have you checked for multiple reader_id's on
your MDS for this Lustre file system?</div>
<div><br>
</div>
<div>P.S. My robinhood configuration file is using nb_threads =
8, just for a data point.</div>
<div><br>
</div>
<div>Cheers,</div>
<div>megan</div>
</div>
<div class="gmail_extra"><br>
<div class="gmail_quote"><br>
</div>
<br>
</div>
</blockquote>
<br>
</body>
</html>