[Lustre-discuss] OSS crash

Thu Dec 4 11:30:16 PST 2008

On Dec 04, 2008  11:34 -0700, Denise Hummel wrote:
> I thought there might be filesystem corruption, however when I run
> e2fsck there are no issues reported.
> The system is now crashing about every hour.  I did get the messages on
> the console and am including them in this messages.  I am not great at
> deciphering the messages, however it looks like a storage problem.  Let
> me know what you think.  I have quite a few scientists impatiently
> waiting to get back on the system.  Thanks!
> 
> {child_rip+0}
> Code: 0f 0b 04 6b 3d a0 ff ff ff ff 36 05 48 8b 43 20 66 44 29 58
> RIP ldisk:ldiskfs_mb_use_best_found+256 RSP

A quick search for ldiskfs_mb_use_best_found in bugzilla reveals bug 16101.

This is an indication that the kernel is trying to use space beyond 8TB and
hitting a bug.  We fixed that bug in 1.6.6, but there is at least one other
problem that we are aware of when we tested the OSTs with 14TB filesystems
(bug 17530).

One option is to download the latest e2fsprogs (1.41.3 I think) from
sourceforge and use this to shrink the OST filesystem to 8TB to avoid
this problem until such a time that Lustre supports OSTs > 8TB.

> On Dec 03, 2008  19:30 -0700, Hummel, Denise wrote:
> > We have a lustre filesystem that has been pretty stable since June
> 2008 on
> > a 200 node cluster until three weeks ago.  The OSS kernel panic has
> > escalated since then to now about every 2 hours.
> > The MDT/MGS is on a x86_64 server with 8G memory and 2 dual core AMD
> procs
> > The OSS is on a x86_64 server with 8G memory and 2 dual core AMD procs
> > One OST raid 6 ~9TB (I know it is larger than currently tested) - at
> 58%
> 
> Running with OSTs > 8TB exposes you to filesystem corruption.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.