[Lustre-discuss] OSS crash

Thu Dec 4 10:34:09 PST 2008

Hi Andreas;

I thought there might be filesystem corruption, however when I run
e2fsck there are no issues reported.
The system is now crashing about every hour.  I did get the messages on
the console and am including them in this messages.  I am not great at
deciphering the messages, however it looks like a storage problem.  Let
me know what you think.  I have quite a few scientists impatiently
waiting to get back on the system.  Thanks!

<ffffffff8032105b>{scheduled_timeout
+375}<ffffffffa048f31d>{:ost:ost_brw_write+9885}
Will spare you the hex and give the messages - let me know if you need
it.
{:ost:ost_brw_read+8528} {default_wake_function+0}
{:ptlrpc:lustre+msg_check_version+69} 
{:ost:ost_bulk_timeout+0} {:ost:ost_handle+12187}
{lnet:lnet_match_block_msg+920}
{ptlrpc:ptlrpc_server_handle_request+2830}
{libcfs:lcw_update_time+30}  {__mod_timer+293}
{ptlrpc:ptlrpc_main+2456} {default_wake_function+0}
{ptlrpc:ptlrpc_retry_rqbds+0}  {ptlrpc:ptlrpc_retry_rqbds+0}
{child_rip+8}  {ptlrpc:ptlrpc_main+0}
{child_rip+0}
Code: 0f 0b 04 6b 3d a0 ff ff ff ff 36 05 48 8b 43 20 66 44 29 58
RIP ldisk:ldiskfs_mb_use_best_found+256 RSP
<0> Kernel panic - not syncing Oops

On Dec 03, 2008  19:30 -0700, Hummel, Denise wrote:
> We have a lustre filesystem that has been pretty stable since June
2008 on
> a 200 node cluster until three weeks ago.  The OSS kernel panic has
> escalated since then to now about every 2 hours.
> The MDT/MGS is on a x86_64 server with 8G memory and 2 dual core AMD
procs
> The OSS is on a x86_64 server with 8G memory and 2 dual core AMD procs
> One OST raid 6 ~9TB (I know it is larger than currently tested) - at
58%

Running with OSTs > 8TB exposes you to filesystem corruption.

Cheers, Andreas
--
Andreas Dilger