[Lustre-discuss] [RESOLVED] Strange MDS Problem + Resolution

Johann Lombardi johann at sun.com
Tue Sep 29 00:17:54 PDT 2009


On Sep 28, 2009, at 12:46 AM, Aaron Knister wrote:
> I wanted to post this here so in the event that anybody else stumbles
> across this problem they don't spend hours banging their head against
> a brick wall. I was helping with a lustre disk setup that kept
> crashing. The lustre filesystem would hang and there would be one
> thread (ll_mdt_[0-9]*) that would be pegged at 100% of the cpu. It
> turns out there was some on disk inconsistencies as a result of the
> MDS crashing because it ran out of memory. A simple fsck of the MDT
> fixed the issue, after many hours of attempted debugging. We didn't
> think the problem could be fixed by a simple fsck...but it makes  
> sense.

Recent kernels have additional checks (in do_split(), but in other
places as well) to prevent this kind of problems (crash or infinite
loop when the layout is corrupted). I wonder if this would catch this
problem and return an error instead. Do you know where in do_split()
the process was stuck?

Cheers,
Johann



More information about the lustre-discuss mailing list