[Lustre-discuss] MDS crashes daily at the same hour

Andreas Dilger adilger at sun.com
Fri Jan 22 03:32:29 PST 2010


On 2010-01-06, at 04:25, David Cohen wrote:
> On Monday 04 January 2010 20:42:12 Andreas Dilger wrote:
>> On 2010-01-04, at 03:02, David Cohen wrote:
>>> I'm using a mixed environment of 1.8.0.1 MDS and 1.6.6 OSS's (had a
>>> problem with qlogic drivers and rolled back to 1.6.6).
>>> My MDS get unresponsive each day at 4-5 am local time, no kernel
>>> panic or error messages before.
>
> I was indeed the *locate update, a simple edit of /etc/updatedb.conf  
> on the
> clients and the system is stable again.

I asked the upstream Fedora/RHEL maintainer of mlocate to add "lustre"  
to the exception list in updatedb.conf, and he has already done so for  
Fedora.  There is also a bug filed for RHEL5 to do the same, if anyone  
is interested in following it:

https://bugzilla.redhat.com/show_bug.cgi?id=557712

>> Judging by the time, I'd guess this is "slocate" or "mlocate" running
>> on all of your clients at the same time.  This used to be a source of
>> extremely high load back in the old days, but I thought that Lustre
>> was in the exclude list in newer versions of *locate.  Looking at the
>> installed mlocate on my system, that doesn't seem to be the case...
>> strange.
>>
>>> Some errors and an LBUG appear in the log after force booting the
>>> MDS and
>>> mounting the MDT and then the log is clear until next morning:
>>>
>>> Jan  4 06:33:31 tech-mds kernel: LustreError: 6357:0:
>>> (class_hash.c:225:lustre_hash_findadd_unique_hnode())
>>> ASSERTION(hlist_unhashed(hnode)) failed
>>> Jan  4 06:33:31 tech-mds kernel: LustreError: 6357:0:
>>> (class_hash.c:225:lustre_hash_findadd_unique_hnode()) LBUG
>>> Jan  4 06:33:31 tech-mds kernel: Lustre: 6357:0:(linux-
>>> debug.c:222:libcfs_debug_dumpstack()) showing stack for process 6357
>>> Jan  4 06:33:31 tech-mds kernel: ll_mgs_02     R  running task
>>> 0  6357
>>> 1                6340 (L-TLB)
>>> Jan  4 06:33:31 tech-mds kernel: Call Trace:
>>> Jan  4 06:33:31 tech-mds kernel: thread_return+0x62/0xfe
>>> Jan  4 06:33:31 tech-mds kernel: __wake_up_common+0x3e/0x68
>>> Jan  4 06:33:31 tech-mds kernel: :ptlrpc:ptlrpc_main+0x1218/0x13e0
>>> Jan  4 06:33:31 tech-mds kernel: default_wake_function+0x0/0xe
>>> Jan  4 06:33:31 tech-mds kernel: audit_syscall_exit+0x31b/0x336
>>> Jan  4 06:33:31 tech-mds kernel: child_rip+0xa/0x11
>>> Jan  4 06:33:31 tech-mds kernel: :ptlrpc:ptlrpc_main+0x0/0x13e0
>>> Jan  4 06:33:31 tech-mds kernel: child_rip+0x0/0x11
>>
>> It shouldn't LBUG during recovery, however.
>>
>> Cheers, Andreas
>> --
>> Andreas Dilger
>> Sr. Staff Engineer, Lustre Group
>> Sun Microsystems of Canada, Inc.
>>
>
> -- 
> David Cohen
> Grid Computing
> Physics Department
> Technion - Israel Institute of Technology
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss


Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.




More information about the lustre-discuss mailing list