[Lustre-discuss] OSS crash
Denise Hummel
denise_hummel at nrel.gov
Thu Dec 4 10:34:09 PST 2008
Hi Andreas;
I thought there might be filesystem corruption, however when I run
e2fsck there are no issues reported.
The system is now crashing about every hour. I did get the messages on
the console and am including them in this messages. I am not great at
deciphering the messages, however it looks like a storage problem. Let
me know what you think. I have quite a few scientists impatiently
waiting to get back on the system. Thanks!
<ffffffff8032105b>{scheduled_timeout
+375}<ffffffffa048f31d>{:ost:ost_brw_write+9885}
Will spare you the hex and give the messages - let me know if you need
it.
{:ost:ost_brw_read+8528} {default_wake_function+0}
{:ptlrpc:lustre+msg_check_version+69}
{:ost:ost_bulk_timeout+0} {:ost:ost_handle+12187}
{lnet:lnet_match_block_msg+920}
{ptlrpc:ptlrpc_server_handle_request+2830}
{libcfs:lcw_update_time+30} {__mod_timer+293}
{ptlrpc:ptlrpc_main+2456} {default_wake_function+0}
{ptlrpc:ptlrpc_retry_rqbds+0} {ptlrpc:ptlrpc_retry_rqbds+0}
{child_rip+8} {ptlrpc:ptlrpc_main+0}
{child_rip+0}
Code: 0f 0b 04 6b 3d a0 ff ff ff ff 36 05 48 8b 43 20 66 44 29 58
RIP ldisk:ldiskfs_mb_use_best_found+256 RSP
<0> Kernel panic - not syncing Oops
On Dec 03, 2008 19:30 -0700, Hummel, Denise wrote:
> We have a lustre filesystem that has been pretty stable since June
2008 on
> a 200 node cluster until three weeks ago. The OSS kernel panic has
> escalated since then to now about every 2 hours.
> The MDT/MGS is on a x86_64 server with 8G memory and 2 dual core AMD
procs
> The OSS is on a x86_64 server with 8G memory and 2 dual core AMD procs
> One OST raid 6 ~9TB (I know it is larger than currently tested) - at
58%
Running with OSTs > 8TB exposes you to filesystem corruption.
Cheers, Andreas
--
Andreas Dilger
More information about the lustre-discuss
mailing list