[Lustre-discuss] [robinhood-support] robinhood error messages
LEIBOVICI Thomas
thomas.leibovici at cea.fr
Wed Nov 24 06:17:06 PST 2010
Thomas Roth wrote:
> Thank you Thomas.
> If these messages mean that robinhood just continues after the
> timeout, it would be nothing to worry about, but I will try to adapt
> the timeout anyhow.
> Right now, however, it seems the scan is really stuck: since days,
> rbh-report -i tells me about 612 TB in the filesystem, but lfs df says
> we have 787 TB ;-)
A couple of such messages would not be a big deal, but 100s/day during
several days is not normal... I suspect a problem on timeout handling in
robinhood, that leads to such a blocking. That's why I suggest you to
avoid timeouts by increasing its value.
> Btw, whenever I restart the scan, e.g. after a reconfiguration such as
> for the timeout, I get the logfile full of
Tips: for changing such a scalar param, you are not obliged to fully
restart the daemon. "service robinhood reload" or "kill -HUP" on the
process is OK.
> > ListMgr | DB query failed in ListMgr_Insert line 340...
> and assorted messages, which seem to indicate that the new robinhood
> scan tries to put something into the DB that is already there, and
> stumbles on this. Or maybe that happens when several robins are
> running simultaneously.
Are you running several instances for scanning the same filesystem??
> I'm not sure if it is a problem for the scan, it is, however, a
> problem for the free space on /var, or wherever I point the log to ;-)
>
> Regards,
> Thomas
>
> On 24.11.2010 13:20, LEIBOVICI Thomas wrote:
>> Hi Thomas,
>>
>> We already stated this, basically after the filesystem was blocked for a
>> while, or after an OSS had crashed.
>> If it is stuck for too long (default timeout is 1 hour), robinhood tries
>> to cancel its operation on current directory and continues with the next
>> one.
>> Maybe it didn't recover successfuly from this cancellation, and you
>> receive those messages since that badly happened.
>>
>> To avoid this problem, you can increase the timeout to a very high
>> value, to make sure it is never reached (e.g. xxx days).
>> In that case, robinhood will remain stuck as long as its current
>> operation in Lustre is blocked,
>> and it will resume the current operation as soon as Lustre is back.
>>
>> You can change this timeout by setting the "scan_op_timeout" parameter
>> in the "FS_Scan" section of config file.
>>
>> Alternatively, you can also keep a reasonable timeout and make robinhood
>> exit when the filesystem is not responding
>> by setting "exit_on_timeout = TRUE" in the same section of the config.
>> So you can respawn robinhood daemon when everything is fixed.
>>
>> Best regards,
>> Thomas LEIBOVICI
>> CEA/DAM
>>
>> > A support request from lustre-discuss.
>> >
>> >
>> ------------------------------------------------------------------------
>> >
>> > Sujet:
>> > [Lustre-discuss] robinhood error messages
>> > Expéditeur:
>> > Thomas Roth <t.roth at gsi.de>
>> > Date:
>> > Tue, 23 Nov 2010 20:20:33 +0100
>> > Destinataire:
>> > lustre-discuss at lists.lustre.org
>> >
>> > Destinataire:
>> > lustre-discuss at lists.lustre.org
>> >
>> >
>> > Hi all,
>> >
>> > we are running robinhood (v2.2.1) on our 1.8.4 cluster (basically to
>> > find out where and who the big space consumers are - no purging).
>> >
>> > Robinhood sends me lots and lots of messages (~100/day) of the type
>> >
>> > > ===== FS scan is blocked (/lustre) =====
>> > > Date: 2010/11/23 20:05:22
>> > > Program: robinhood (pid 4826)
>> > > Host: lxb310
>> > > Filesystem: /lustre
>> > > A thread has been inactive for 3660 sec
>> > > while scanning directory /lustre/....
>> >
>> > This seems to indicate some trouble accessing certain directories
>> on the
>> > node where robinhood is running. However, this is independent of the
>> > node, and at the same time we neither see any issues / slowness/
>> > connectivity problems nor get any user complaints of the like.
>> >
>> > So I wonder whether anybody else is using robinhood and has seen
>> similar
>> > messages.
>> >
>> > Regards,
>> > Thomas
>> > _______________________________________________
>> > Lustre-discuss mailing list
>> > Lustre-discuss at lists.lustre.org
>> > http://lists.lustre.org/mailman/listinfo/lustre-discuss
>> >
>> >
>> >
>> ------------------------------------------------------------------------
>> >
>> >
>> ------------------------------------------------------------------------------
>>
>> > Increase Visibility of Your 3D Game App & Earn a Chance To Win $500!
>> > Tap into the largest installed PC base & get more eyes on your
>> game by
>> > optimizing for Intel(R) Graphics Technology. Get started today
>> with the
>> > Intel(R) Software Partner Program. Five $500 cash prizes are up for
>> grabs.
>> > http://p.sf.net/sfu/intelisp-dev2dev
>> >
>> ------------------------------------------------------------------------
>> >
>> > _______________________________________________
>> > robinhood-support mailing list
>> > robinhood-support at lists.sourceforge.net
>> > https://lists.sourceforge.net/lists/listinfo/robinhood-support
>> >
>>
>
>
More information about the lustre-discuss
mailing list