[lustre-discuss] Robinhood exhausting RPC resources against 2.5.5 lustre file systems

Fri May 19 07:39:14 PDT 2017

Hi Megan (et al.),

I don't understand the behavior, either... I've worked successfully with 
changelogs in the past, and indeed it is very lightweight. (Since 
robinhood has not been running anywhere, I'd already removed all the 
changelog readers from the various MDTs for the reasons you noted.)

Whatever my problem is does not manifest as a load issue, on either 
client or MDT side. It manifests rather as some sort of connection 
failure. Here's the most recent example, which maybe will generate more 
ideas as to cause.

On our third lustre fs (one we use for backups), I was able to complete 
a file system scan to populate the database, but then when I activated 
changelogs, the client almost immediately experienced the disconnections 
we've seen on the other two systems.

Here's the log from the MDT (heinlein, 10.7.17.126). The robinhood 
client is akebono (10.7.17.122):
May 16 16:05:51 heinlein kernel: Lustre: lard-MDD0000: changelog on
May 16 16:05:51 heinlein kernel: Lustre: Modifying parameter 
general.mdd.lard-MDT*.changelog_mask in log params
May 16 16:13:16 heinlein kernel: Lustre: lard-MDT0000: Client 
2d1aedc0-1f5e-2741-689a-169922a2593b (at 10.7.17.122 at o2ib) reconnecting
May 16 16:13:17 heinlein kernel: Lustre: lard-MDT0000: Client 
2d1aedc0-1f5e-2741-689a-169922a2593b (at 10.7.17.122 at o2ib) reconnecting
May 16 16:13:17 heinlein kernel: Lustre: Skipped 7458 previous similar 
messages

Here's what akebono (10.7.17.122) reported:

May 16 16:13:16 akebono kernel: LustreError: 11-0: 
lard-MDT0000-mdc-ffff880fd68d7000: Communicating with 10.7.17.126 at o2ib, 
operation llog_origin_handle_destroy failed with -19.
May 16 16:13:16 akebono kernel: Lustre: 
lard-MDT0000-mdc-ffff880fd68d7000: Connection to lard-MDT0000 (at 
10.7.17.126 at o2ib) was lost; in progress operations using this service 
will wait for recovery to complete
May 16 16:13:16 akebono kernel: Lustre: 
lard-MDT0000-mdc-ffff880fd68d7000: Connection restored to lard-MDT0000 
(at 10.7.17.126 at o2ib)
May 16 16:13:17 akebono kernel: LustreError: 11-0: 
lard-MDT0000-mdc-ffff880fd68d7000: Communicating with 10.7.17.126 at o2ib, 
operation llog_origin_handle_destroy failed with -19.
May 16 16:13:17 akebono kernel: LustreError: Skipped 7458 previous 
similar messages
May 16 16:13:17 akebono kernel: Lustre: 
lard-MDT0000-mdc-ffff880fd68d7000: Connection to lard-MDT0000 (at 
10.7.17.126 at o2ib) was lost; in progress operations using this service 
will wait for recovery to complete
May 16 16:13:17 akebono kernel: Lustre: Skipped 7458 previous similar 
messages
May 16 16:13:17 akebono kernel: Lustre: 
lard-MDT0000-mdc-ffff880fd68d7000: Connection restored to lard-MDT0000 
(at 10.7.17.126 at o2ib)
May 16 16:13:17 akebono kernel: Lustre: Skipped 7458 previous similar 
messages
May 16 16:13:18 akebono kernel: LustreError: 11-0: 
lard-MDT0000-mdc-ffff880fd68d7000: Communicating with 10.7.17.126 at o2ib, 
operation llog_origin_handle_destroy failed with -19.
May 16 16:13:18 akebono kernel: LustreError: Skipped 14924 previous 
similar messages

Jessica

On 5/19/17 8:58 AM, Ms. Megan Larko wrote:
> Greetings Jessica,
>
> I'm not sure I am correctly understanding the behavior "robinhood 
> activity floods the MDT".   The robinhood program as you (and I) are 
> using it is consuming the MDT CHANGELOG via a reader_id which was 
> assigned when the CHANGELOG was enabled on the MDT.   You can check 
> the MDS for these readers via "lctl get_param mdd.*.changelog_users".  
> Each CHANGELOG reader must either be consumed by a process or 
> destroyed otherwise the CHANGELOG will grow until it consumes 
> sufficient space to stop the MDT from functioning correctly.  So 
> robinhood should consume and then clear the CHANGELOG via this 
> reader_id.  This implementation of robinhood is actually a rather 
> light-weight process as far as the MDS is concerned.   The load issues 
> I encountered were on the robinhood server itself which is a separate 
> server from the Lustre MGS/MDS server.
>
> Just curious, have you checked for multiple reader_id's on your MDS 
> for this Lustre file system?
>
> P.S. My robinhood configuration file is using nb_threads = 8, just for 
> a data point.
>
> Cheers,
> megan
>
>
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20170519/6605c373/attachment-0001.htm>