[Lustre-discuss] [robinhood-support] robinhood error messages

LEIBOVICI Thomas thomas.leibovici at cea.fr
Wed Nov 24 06:17:06 PST 2010


Thomas Roth wrote:
> Thank you Thomas.
> If these messages mean that robinhood just continues after the 
> timeout, it would be nothing to worry about, but I will try to adapt 
> the timeout anyhow.
> Right now, however, it seems the scan is really stuck: since days, 
> rbh-report -i tells me about 612 TB in the filesystem, but lfs df says 
> we have 787 TB ;-)
A couple of such messages would not be a big deal, but 100s/day during 
several days is not normal... I suspect a problem on timeout handling in 
robinhood, that leads to such a blocking. That's why I suggest you to 
avoid timeouts by increasing its value.
> Btw, whenever I restart the scan, e.g. after a reconfiguration such as 
> for the timeout, I get the logfile full of
Tips: for changing such a scalar param, you are not obliged to fully 
restart the daemon. "service robinhood reload" or "kill -HUP" on the 
process is OK.
> > ListMgr | DB query failed in ListMgr_Insert line 340...
> and assorted messages, which seem to indicate that the new robinhood 
> scan tries to put something into the DB that is already there, and 
> stumbles on this. Or maybe that happens when several robins are 
> running simultaneously.
Are you running several instances for scanning the same filesystem??
> I'm not sure if it is a problem for the scan, it is, however, a 
> problem for the free space on /var, or wherever I point the log to ;-)
>
> Regards,
> Thomas
>
> On 24.11.2010 13:20, LEIBOVICI Thomas wrote:
>> Hi Thomas,
>>
>> We already stated this, basically after the filesystem was blocked for a
>> while, or after an OSS had crashed.
>> If it is stuck for too long (default timeout is 1 hour), robinhood tries
>> to cancel its operation on current directory and continues with the next
>> one.
>> Maybe it didn't recover successfuly from this cancellation, and you
>> receive those messages since that badly happened.
>>
>> To avoid this problem, you can increase the timeout to a very high
>> value, to make sure it is never reached (e.g. xxx days).
>> In that case, robinhood will remain stuck as long as its current
>> operation in Lustre is blocked,
>> and it will resume the current operation as soon as Lustre is back.
>>
>> You can change this timeout by setting the "scan_op_timeout" parameter
>> in the "FS_Scan" section of config file.
>>
>> Alternatively, you can also keep a reasonable timeout and make robinhood
>> exit when the filesystem is not responding
>> by setting "exit_on_timeout = TRUE" in the same section of the config.
>> So you can respawn robinhood daemon when everything is fixed.
>>
>> Best regards,
>> Thomas LEIBOVICI
>> CEA/DAM
>>
>>  > A support request from lustre-discuss.
>>  >
>>  > 
>> ------------------------------------------------------------------------
>>  >
>>  > Sujet:
>>  > [Lustre-discuss] robinhood error messages
>>  > Expéditeur:
>>  > Thomas Roth <t.roth at gsi.de>
>>  > Date:
>>  > Tue, 23 Nov 2010 20:20:33 +0100
>>  > Destinataire:
>>  > lustre-discuss at lists.lustre.org
>>  >
>>  > Destinataire:
>>  > lustre-discuss at lists.lustre.org
>>  >
>>  >
>>  > Hi all,
>>  >
>>  > we are running robinhood (v2.2.1) on our 1.8.4 cluster (basically to
>>  > find out where and who the big space consumers are - no purging).
>>  >
>>  > Robinhood sends me lots and lots of messages (~100/day) of the type
>>  >
>>  > > ===== FS scan is blocked (/lustre) =====
>>  > > Date: 2010/11/23 20:05:22
>>  > > Program: robinhood (pid 4826)
>>  > > Host: lxb310
>>  > > Filesystem: /lustre
>>  > > A thread has been inactive for 3660 sec
>>  > > while scanning directory /lustre/....
>>  >
>>  > This seems to indicate some trouble accessing certain directories 
>> on the
>>  > node where robinhood is running. However, this is independent of the
>>  > node, and at the same time we neither see any issues / slowness/
>>  > connectivity problems nor get any user complaints of the like.
>>  >
>>  > So I wonder whether anybody else is using robinhood and has seen 
>> similar
>>  > messages.
>>  >
>>  > Regards,
>>  > Thomas
>>  > _______________________________________________
>>  > Lustre-discuss mailing list
>>  > Lustre-discuss at lists.lustre.org
>>  > http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>  >
>>  >
>>  > 
>> ------------------------------------------------------------------------
>>  >
>>  >
>> ------------------------------------------------------------------------------ 
>>
>>  > Increase Visibility of Your 3D Game App & Earn a Chance To Win $500!
>>  > Tap into the largest installed PC base & get more eyes on your 
>> game by
>>  > optimizing for Intel(R) Graphics Technology. Get started today 
>> with the
>>  > Intel(R) Software Partner Program. Five $500 cash prizes are up for
>> grabs.
>>  > http://p.sf.net/sfu/intelisp-dev2dev
>>  > 
>> ------------------------------------------------------------------------
>>  >
>>  > _______________________________________________
>>  > robinhood-support mailing list
>>  > robinhood-support at lists.sourceforge.net
>>  > https://lists.sourceforge.net/lists/listinfo/robinhood-support
>>  >
>>
>
>




More information about the lustre-discuss mailing list