[Lustre-discuss] MDS crashes daily at the same hour

David Cohen cdavid at physics.technion.ac.il
Wed Jan 6 01:25:40 PST 2010


On Monday 04 January 2010 20:42:12 Andreas Dilger wrote:
> On 2010-01-04, at 03:02, David Cohen wrote:
> > I'm using a mixed environment of 1.8.0.1 MDS and 1.6.6 OSS's (had a
> > problem
> > with qlogic drivers and rolled back to 1.6.6).
> > My MDS get unresponsive each day at 4-5 am local time, no kernel
> > panic or
> > error messages before.

I was indeed the *locate update, a simple edit of /etc/updatedb.conf on the 
clients and the system is stable again.
Many Thanks.


> 
> Judging by the time, I'd guess this is "slocate" or "mlocate" running
> on all of your clients at the same time.  This used to be a source of
> extremely high load back in the old days, but I thought that Lustre
> was in the exclude list in newer versions of *locate.  Looking at the
> installed mlocate on my system, that doesn't seem to be the case...
> strange.
> 
> > Some errors and an LBUG appear in the log after force booting the
> > MDS and
> > mounting the MDT and then the log is clear until next morning:
> >
> > Jan  4 06:33:31 tech-mds kernel: LustreError: 6357:0:
> > (class_hash.c:225:lustre_hash_findadd_unique_hnode())
> > ASSERTION(hlist_unhashed(hnode)) failed
> > Jan  4 06:33:31 tech-mds kernel: LustreError: 6357:0:
> > (class_hash.c:225:lustre_hash_findadd_unique_hnode()) LBUG
> > Jan  4 06:33:31 tech-mds kernel: Lustre: 6357:0:(linux-
> > debug.c:222:libcfs_debug_dumpstack()) showing stack for process 6357
> > Jan  4 06:33:31 tech-mds kernel: ll_mgs_02     R  running task
> > 0  6357
> > 1                6340 (L-TLB)
> > Jan  4 06:33:31 tech-mds kernel: Call Trace:
> > Jan  4 06:33:31 tech-mds kernel: thread_return+0x62/0xfe
> > Jan  4 06:33:31 tech-mds kernel: __wake_up_common+0x3e/0x68
> > Jan  4 06:33:31 tech-mds kernel: :ptlrpc:ptlrpc_main+0x1218/0x13e0
> > Jan  4 06:33:31 tech-mds kernel: default_wake_function+0x0/0xe
> > Jan  4 06:33:31 tech-mds kernel: audit_syscall_exit+0x31b/0x336
> > Jan  4 06:33:31 tech-mds kernel: child_rip+0xa/0x11
> > Jan  4 06:33:31 tech-mds kernel: :ptlrpc:ptlrpc_main+0x0/0x13e0
> > Jan  4 06:33:31 tech-mds kernel: child_rip+0x0/0x11
> 
> It shouldn't LBUG during recovery, however.
> 
> Cheers, Andreas
> --
> Andreas Dilger
> Sr. Staff Engineer, Lustre Group
> Sun Microsystems of Canada, Inc.
> 

-- 
David Cohen
Grid Computing
Physics Department
Technion - Israel Institute of Technology



More information about the lustre-discuss mailing list