[lustre-discuss] Robinhood exhausting RPC resources against 2.5.5 lustre file systems
Jessica Otey
jotey at nrao.edu
Fri May 19 07:39:14 PDT 2017
Hi Megan (et al.),
I don't understand the behavior, either... I've worked successfully with
changelogs in the past, and indeed it is very lightweight. (Since
robinhood has not been running anywhere, I'd already removed all the
changelog readers from the various MDTs for the reasons you noted.)
Whatever my problem is does not manifest as a load issue, on either
client or MDT side. It manifests rather as some sort of connection
failure. Here's the most recent example, which maybe will generate more
ideas as to cause.
On our third lustre fs (one we use for backups), I was able to complete
a file system scan to populate the database, but then when I activated
changelogs, the client almost immediately experienced the disconnections
we've seen on the other two systems.
Here's the log from the MDT (heinlein, 10.7.17.126). The robinhood
client is akebono (10.7.17.122):
May 16 16:05:51 heinlein kernel: Lustre: lard-MDD0000: changelog on
May 16 16:05:51 heinlein kernel: Lustre: Modifying parameter
general.mdd.lard-MDT*.changelog_mask in log params
May 16 16:13:16 heinlein kernel: Lustre: lard-MDT0000: Client
2d1aedc0-1f5e-2741-689a-169922a2593b (at 10.7.17.122 at o2ib) reconnecting
May 16 16:13:17 heinlein kernel: Lustre: lard-MDT0000: Client
2d1aedc0-1f5e-2741-689a-169922a2593b (at 10.7.17.122 at o2ib) reconnecting
May 16 16:13:17 heinlein kernel: Lustre: Skipped 7458 previous similar
messages
Here's what akebono (10.7.17.122) reported:
May 16 16:13:16 akebono kernel: LustreError: 11-0:
lard-MDT0000-mdc-ffff880fd68d7000: Communicating with 10.7.17.126 at o2ib,
operation llog_origin_handle_destroy failed with -19.
May 16 16:13:16 akebono kernel: Lustre:
lard-MDT0000-mdc-ffff880fd68d7000: Connection to lard-MDT0000 (at
10.7.17.126 at o2ib) was lost; in progress operations using this service
will wait for recovery to complete
May 16 16:13:16 akebono kernel: Lustre:
lard-MDT0000-mdc-ffff880fd68d7000: Connection restored to lard-MDT0000
(at 10.7.17.126 at o2ib)
May 16 16:13:17 akebono kernel: LustreError: 11-0:
lard-MDT0000-mdc-ffff880fd68d7000: Communicating with 10.7.17.126 at o2ib,
operation llog_origin_handle_destroy failed with -19.
May 16 16:13:17 akebono kernel: LustreError: Skipped 7458 previous
similar messages
May 16 16:13:17 akebono kernel: Lustre:
lard-MDT0000-mdc-ffff880fd68d7000: Connection to lard-MDT0000 (at
10.7.17.126 at o2ib) was lost; in progress operations using this service
will wait for recovery to complete
May 16 16:13:17 akebono kernel: Lustre: Skipped 7458 previous similar
messages
May 16 16:13:17 akebono kernel: Lustre:
lard-MDT0000-mdc-ffff880fd68d7000: Connection restored to lard-MDT0000
(at 10.7.17.126 at o2ib)
May 16 16:13:17 akebono kernel: Lustre: Skipped 7458 previous similar
messages
May 16 16:13:18 akebono kernel: LustreError: 11-0:
lard-MDT0000-mdc-ffff880fd68d7000: Communicating with 10.7.17.126 at o2ib,
operation llog_origin_handle_destroy failed with -19.
May 16 16:13:18 akebono kernel: LustreError: Skipped 14924 previous
similar messages
Jessica
On 5/19/17 8:58 AM, Ms. Megan Larko wrote:
> Greetings Jessica,
>
> I'm not sure I am correctly understanding the behavior "robinhood
> activity floods the MDT". The robinhood program as you (and I) are
> using it is consuming the MDT CHANGELOG via a reader_id which was
> assigned when the CHANGELOG was enabled on the MDT. You can check
> the MDS for these readers via "lctl get_param mdd.*.changelog_users".
> Each CHANGELOG reader must either be consumed by a process or
> destroyed otherwise the CHANGELOG will grow until it consumes
> sufficient space to stop the MDT from functioning correctly. So
> robinhood should consume and then clear the CHANGELOG via this
> reader_id. This implementation of robinhood is actually a rather
> light-weight process as far as the MDS is concerned. The load issues
> I encountered were on the robinhood server itself which is a separate
> server from the Lustre MGS/MDS server.
>
> Just curious, have you checked for multiple reader_id's on your MDS
> for this Lustre file system?
>
> P.S. My robinhood configuration file is using nb_threads = 8, just for
> a data point.
>
> Cheers,
> megan
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20170519/6605c373/attachment-0001.htm>
More information about the lustre-discuss
mailing list